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REMARKS 

Reexamination and reconsideration in light of the foregoing amendment and following 
remarks is respectfully requested. 

Claims 1, 21, 23, 24, 26, 30-32 and 34-49 are pending in this application. Claims 2-20, 22, 

25, 27-29 and 33 have been canceled without prejudice or disclaimer. Claims 1, 21, 23, 24, 26 and 

30-32 have been amended and new claims 34-49 have been added. No new matter has been added 

to the application. Support for the amendments to the claims and for the new claims can be found in 

the specification as follows: 

Claim 1: The group insulin receptor (IR), IGF-1 receptor (IGF-1R) and insulin 
receptor related receptor (IRR): page 6, lines 14-17. 
Step (C) (i): page 7, lines 20-23. 
Step (C) (ii): page 17, lines 28-34. 

Claim 21 : The group IR, IGF-1R and IRR: page 6, lines 14-17. 

Claim 23: Original claim 23 and page 7, lines 20-23. 

Claim 24: Original claim 24, page 1 7, lines 28-34 and page 13, lines 1 5-20. 

Claim 26: Original claim 26, page 1 7, lines 28-34 and page 13, lines 1 5-20. 

Claim 34: Page 1 7, lines 28-34. 

Claim 35: Page 28, line 28-33. 

Claim 36: Page 6, lines 14-17 and page 28, lines 28-34. 

Claim 37: Page 29, lines 1-3. 

Claim 38: Page 6, lines 14-17 and page 29 lines 1-3. 

Claim 39: Original claim 4. 

Claim 40: Original claim 4 and page 7, lines 5-6. 
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Claim 41: 


Page 7, lines 2-5. 


Claim 42: 


Page 6, lines 14-17 and page 7, lines 2-5. 


Claim 43: 


Original claim 2. 


Claim 44: 


Original claim 13, page 17, lines 28-34 and page 13, lines 15-20. 


Claim 45: 


Original claim 14, page 17, lines 28-34 and page 13, lines 15-20. 


Claim 46: 


Original claim 15, page 6, lines 14-17. 


Claim 47: 


Original claim 1, page 7, lines 13-15 and page 7, lines 23-24. 


Claim 48: 


Page 7, lines 13-15. 


Claim 49: 


Page 7, lines 23-24. 



Applicants note the Examiner's consideration of the art cited in the Information Disclosure 
Statement filed May 26, 2000 as acknowledged in the Office Action Summary. Applicants further 
note the Examiner's acknowledgment of Applicant's claim of foreign priority under 35 U.S.C. §119 
and receipt of the certified priority document. 

Applicants want to thank the Examiner for granting the interview on June 24, 2003. The 
interview provided a greater understanding of the Examiner's position with respect to the following 
rejections. 

Rejection Under 35 U.S.C. S 101 
Claims 1-20 were rejected under 35 U.S.C. § 101 as being directed to non-statutory subject 
matter because "the only steps set forth in claims 1-20 are manipulation of data." Claims 2-20 have 
been canceled, thereby rendering the rejection moot as to these claims. Claim 1 has been amended 
to recite a method that is more than a manipulation of data. The claim as amended requires (i) 
assessing the stereochemical complementarity between a compound and a molecule, (ii) obtaining a 
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compound that possesses stereochemical complementarity to the molecule; and (iii) testing the 
compound. For the foregoing reason, it believed that the amendment to claim 1 overcomes the 
rejection and it is respectfully requested that the rejection be reconsidered and withdrawn. 
Rejection Under 35 U.S.C. § 112. First Paragraph 

Claims 1-33 were rejected under 35 U.S.C. § 112, first paragraph, for lack of enablement. 
The Examiner at the interview indicated that the rejection comports with the guidelines set forth in 
the Trilateral Project WM4 "Report on comparative study on protein 3 -dimensional (3-D) structure 
relating claims" and that the claims need to be restructured to overcome this rejection. 

Claims 2-20, 22, 25, 27-29 and 33 have been canceled, thereby rendering the rejection as to 
these claims moot. As for the remaining claims, claim 1 has been amended to recite a method of 
"identifying" a compound as opposed to "designing" or selecting a compound. Claim 1 has been 
further amended to require that the method is directed to identifying a compound that (i) modulates 
binding of a natural ligand to the insulin receptor (IR), IGF-1 receptor (IGF-1R) or insulin receptor 
related receptor (IRR) or (ii) modulates signal transduction via IR, IGF-1R or IRR. 

Applicants' position is that the level of skill of those working in the field of in silico 
screening at around the priority date of the present application (i.e., around November 1997) was 
relatively high. More specifically, the average capabilities of those working in this field included 
the ability to identify candidate binding pockets within any given 3D structure using standard 
methodologies. 

Computer algorithms that may be used for this purpose, include, for example, PASS 
(evidence: Brady, G.P., Jr. et al., "Fast prediction and visualization of protein binding pockets with 
PASS," J. Comput Aided Mol. Des., vol. 14, 383-401 (2001); copy attached as Exhibit A). The 
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PASS algorithm involves coating the surface of the protein structure model with sets of probe 
spheres, retaining those with low solvent accessibility and identifying some of these as likely 
centers of binding pockets. A person skilled in this field would have been fully familiar with the 
implementation of a range of docking programs (such as those listed in the patent application) to 
screen for candidate binding ligands. Evidence of the techniques that would be well within the 
capabilities of those skilled in this field are described in the following publications: 

(1) Li et al., "Structure-based design of parasitic protease inhibitors," Bioorg Med 
Chem, 1996 Sep, 4(9):1421-7. Copy attached as Exhibit B. 

(2) Ring et al., "Structure-based inhibitor design by using protein models for the 
development of antiparasitic agents," Proc Natl Acad Sci USA, 1993 Apr 15, 90(8):3583-7. Copy 
attached as Exhibit C. 

(3) Li et al., "Anti-malarial drug development using models of enzyme structure," Chem 
Biol, 1994 Sep, l(l):31-7. Copy attached as Exhibit D. 

As mentioned above, Applicants submit that any competent researcher working in the field 
of in silico screening would be able to identify candidate binding pockets in any given 3D structure. 
In the present case, however, the patent application actually identifies specific "topographic regions" 
which represent preferred "binding pockets" within the IGF-1R structure. These binding pockets 
can be used in screening methods to identify potential ligands. For example, the fragment which 
includes residues 191-290 forms part of the cys-rich region of the ectodomain of IGF-1R. This 
region is important in determining ligand binding specificity. The specificity determinant can be 
further limited to residues 223-274 (see page 28, line 12 to page 29, line 15). 
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The patent application provides further guidance for selecting regions within the identified 
binding pockets at page 6, line 26 to page 7, line 12. For example, it is stated that the ligand may 
interact with (i) a region of the LI domain-cys-rich interface, thereby causing an alteration in the 
positions of the domains relative to each other; (ii) a hinge region between the L2 domain and the 
cys-rich domain causing an alteration in the positions of these domains relative to each other; or (iii) 
the P-sheet of the LI domain causing an alteration in the position of the LI domain relative to the 
position of the cys-rich domain or L2 domain. 

The patent application goes even further by specifying two sites on the lower P-sheet of the 
LI and L2 domains as suitable targets for screening. See, for example, the specification at page 6, 
lines 9-14. 

Accordingly, the patent application not only identifies the binding pockets within the IGF- 
1R structure, but also suggests preferred regions within these binding pockets to use in screening for 
ligands. 

Armed with the atomic coordinates of the IGF-1R provided in the patent application and the 
information regarding preferred regions within specified binding pockets, it would have been a 
matter of routine for a person skilled in the area of in silico screening to utilize any one of the well 
known docking programs to screen for potential ligands. 

On page 3 of the Office Action, the Examiner states that it is unknown and cannot be 
predicted from the information presented in the specification what degree of stereochemical 
complementarity is required. Stereochemical complementarity between a chemical compound and 
the target protein structure is a cumulative effect of the hydrogen bonds, favorable electrostatic 
interactions, and favorable van der Waals contacts between the compound molecule and the protein 
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molecule. Depending on the nature of the compound molecule, one factor may predominate over 
others in contributing to the overall complementarity. Those skilled in the art can visually examine 
on a computer graphics monitor a compound molecule docked into the binding site of the receptor 
and assess the source of the complementarity. All docking programs have scoring functions which 
are used to dock and then rank the molecules with a score indicating how well a particular chemical 
compound molecule binds to the receptor. 

The top-ranking compounds can then be further assessed visually and computationally. For 
example, a computer program such as XSCORE (evidence: Wang, R. et al., "Further development 
and validation of empirical scoring functions for structure-based binding affinity prediction," J. 
Compu Aided Mol. Design, Vol 16, 11-26(2002); copy attached as Exhibit E) has a scoring function 
which predicts the dissociation constant for a given ligand-protein complex structure (for example, 
the docked compound-receptor complex). This scoring function was derived by fitting the function 
to the experimentally determined dissociation constants of a set of 200 ligand-protein complexes. 
As a general rule, stereochemical complementarity is not discussed in terms of degree. In silico 
screening requires certain parameters to be set to determine whether or not any given molecule will 
register as a stereochemical "fit" with the binding site of interest. A person of skill in this area 
would be able to set appropriate parameters through trial and error to select for suitable 
stereochemical complementarity to a binding site described in the patent application. 

On page 3 of the Office Action, the Examiner points out that claim 1 requires that the 
selected compound bind to any molecule of the insulin receptor family and modulate any activity 
mediated by the molecule. Claim 1 has been amended to make it clear that the compound is tested 
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for its ability to either modulate binding of a natural ligand to or to modulate signal transduction via 
a member of the insulin family selected from IGF-1R, IR and IRR. 

For all of the foregoing reasons, the specification provides sufficient guidance that a person 
having ordinary skill in the art would have been able to practice the invention without undue 
experimentation. Accordingly, claims 1, 21, 23, 24, 26, 30-32 satisfy the requirement of 35 U.S.C. 
§ 1 12, first paragraph. It is respectfully requested that the rejection be reconsidered and withdrawn. 
Rejection Under 35 U.S.C. § 112. Second Paragraph 

Claims 1-33 were rejected under 35 U.S.C. § 112, second paragraph, as being indefinite. 
The Examiner objected to language specifically recited in claims 1, 3-6, 16-20 and 21. Claims 2-20 
have been canceled, thereby rendering any rejection under 35 U.S.C. § 112, second paragraph, as to 
these claims moot. 

In claim 1, the Examiner objected to the phrase "assessing the stereochemical 
complementarity between the compound and receptor site of the molecule" in that she did not know 
"what delimits a topographical region." The phrase has been deleted from the claim. It is believed 
that by this amendment, the rejection is overcome. 

Also in claim 1, the Examiner objected to the phrases "substantially as shown" and "forms 
an equivalent." With respect to the term "substantially," Applicants submit that a person skilled in 
the art would have understood that the coordinates set out in Fig. 1 need not be strictly adhered to in 
order to generate a three dimensional structure in silico for screening for ligands of the IR, IGF-1R 
or IRR. See Section 2173.05(b) of MPEP, and in particular, the discussion on the term 
"substantially." This Section refers to a decision in which the phrase "which produces substantially 
equal E and H plane illumination patterns" was considered definite because one of ordinary skill in 
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the art would know what is meant by "substantially equal". Andrew Corp v Gabriel Electronics, 
847 F.2d 819, 6 USPQ2d 2010 (Fed. Cir. 1988). In the present case, given the nature of the 
invention and the experience of those skilled in the art of in silico screening, the phrase 
"substantially as shown in Figure 1" in relation to coordinates would have been clearly understood. 
Applicants also point out that this phrase is present in a method claim which involves numerous 
steps including obtaining a compound with requisite stereochemical complementarity and testing 
the compound for its ability to modulate binding of a natural ligand to or signal transduction via the 
IR, IGF-1R or IRR. Within the context of this screening process, a person skilled in the art would 
have understood the flexibility in variation from the exact coordinates shown in Fig. 1 which would 
allow generation of a structure with sufficient identity to the IGF-1R receptor coordinates listed in 
Fig. 1 to allow screening for ligands. 

With respect to the phrase "forms an equivalent" in claim 1, this phrase has now been 
amended to refer to an amino acid sequence of IR or IRR that forms an equivalent structure to that 
formed by amino acids 1-462 or IGF-1R. In light of the information provided in the specification, 
and in particular, the sequence alignment information provided in Fig. 9, a person skilled in the art 
would have clearly understood what is meant by this phrase. 

With regard to claim 21, the Examiner found the phrase "which are structurally similar to a 
portion" to be indefinite because "it is unknown what degree of similarity is required to meet this 
limitation." It is Applicants' position that the phrase would have been clearly understood by a 
person skilled in the art. In the context of the claim, it would have been clear to such a person that 
the selected chemical structures must be similar to a sufficient portion of the criteria data set to 
allow binding of the actual compound to a member of the insulin receptor family. 

15 



Application No. 09/555,275 



For all of the foregoing reasons, it is respectfully requested that the rejection of claims 1 and 
21 under 35 U.S. C. § 112, second paragraph, be reconsidered and withdrawn. 

Rejection Under 35 U.S.C. § 103fa) 

Claims 1-20 and 29-34 were rejected under 35 U.S.C. § 103(a) as being unpatentable over 
Hendry et al. (U.S. Patent No. 5,705,335).' Claims 2-20, 29 and 33 have been canceled, thereby 
rendering the rejection as to these claims moot. 

The Examiner finds that the claims are obvious in light of Hendry et al. This reference 
relates to a computer based method for creating a pharmacophore which involves determining the 
optimal fit of compounds into nucleic acid sequences such that the lowest energy of interaction and 
best steric fit are obtained. In support of this rejection, the Examiner refers to the Trilateral Project 
WM4 "Report on comparative studies on protein 3-dimensional (3-D) structure related claims." 
This report states that if the difference between the prior art and the claimed invention as a whole is 
limited to descriptive material stored on or employed by a machine, it is necessary to determine 
whether the descriptive material is functionally descriptive or non-functionally descriptive material. 

Applicants submit that the difference between amended claim 1 and the prior art is not 
merely limited to descriptive material stored on or employed by a machine. In particular, claim 1 as 
proposed to be amended not only involves the step of assessing stereochemical complementarity 
between a compound and the 3-dimensional structure of a molecule of the insulin receptor family, 



1 The inclusion of claim 34 in this rejection is in error since as of the date of the Office Action, there were only 33 
claims. It is believed that the Examiner intended the rejection to encompass claims 1 -20 and 29-33. 
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but also involves the step (B) of obtaining a compound which possesses stereochemical 
complementarity to the molecule; and step (C) of testing the compound for its ability to (i) modulate 
binding of a natural ligand to the IR, IGF-1R or IRR, or (ii) modulate signal transduction via the IR, 
IGF-lRor IRR. 

Steps (B) and (C) of claim 1 are clearly not merely descriptive material stored on or 
employed by a machine. These steps involve physical testing of compounds for their ability to 
modulate a specified activity of a member of the insulin receptor family. Hendry et al. do not 
disclose or suggest a method of screening for a compound which binds to a molecule of the insulin 
receptor family followed by testing of compounds identified for their ability to modulate either 
binding of natural ligands or signal transduction via a member of the insulin receptor family. 

Claims 30 and 32 have been amended to be dependent on claim 1 while claim 31 has been 
amended to be further dependent on claim 30. It is respectfully submitted, therefore, that claims 1 
and 30-32 as amended are clearly novel and non-obvious over Hendry et al. It is requested that the 
rejection be reconsidered and withdrawn. 

New Claims 

New claims 34-49 are presented for examination. Claims 34-46 are dependent on base 
claim 1. For the reasons set forth above for patentability of claim 1, it is believed that new claims 
34-46 are allowable. New claim 47, and its dependent claims, claims 48 and 49, are directed to the 
method steps (A) and (B) as in claim 1, but also require in step (C) "selecting a compound that has a 
Kb or Ki of less than 10" 6 M for IR, IGF-1R or IRR." Hendry et al. does not teach or suggest step 
(C) as required by claim 47. For all of the foregoing reasons, it is believed that claims 34-49 are 
patentable. 
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Conclusion 

It is submitted that the claims 1, 21, 23, 24, 26, 30-32 and 34-49 are patentable over the 
teachings of the prior art relied upon by the Examiner as well as comply with the requirements of 35 
U.S.C. § 112, first and second paragraphs. Accordingly, favorable reconsideration of the claims is 
requested in light of the preceding amendments and remarks. Allowance of the claims is 
courteously solicited. 

To the extent necessary, a petition for a three-month extension of time under 37 C.F.R. 1.136 
is hereby made. Please charge any shortage in fees due in connection with the filing of this paper, 
including extension of time fees, to Deposit Account 500417 and please credit any excess fees to 
such deposit account. 

Respectfully submitted, 
McDERMOTT, WILL & EMERY 




Cameron K. Weiffenbach 
Registration No. 44,488 

600 13 th Street, N.W. 
Washington, DC 20005-3096 
(202) 756-8000 CKW:jdj 
Facsimile: (202) 756-8087 
Date: August 6, 2003 
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APPENDIX 

Replacement sheets of the drawings and annotated sheets showing changes in original Figs. 
1 and 9 are attached hereto. 
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Summary 



PASS (Putative Active Sites with Spheres) is a simple computational tool that 
uses geometry to characterize regions of buried volume in proteins and to identify 
positions likely to represent binding sites based upon the size, shape, and burial extent 
of these volumes. PASS'S utility as a predictive tool for binding site identification is 
tested by predicting known binding sites of proteins in the PDB using both complexed 
macromolecules and their corresponding apo-protein structures. The results indicate 
that PASS can serve as a front-end to fast docking. The main utility of PASS lies in the 
fact that it can analyze a moderate-size protein (~ 30 kD) in under twenty seconds, 
which makes it suitable for interactive molecular modeling, protein database analysis, 
and aggressive virtual screening efforts. As a modeling tool, PASS (i) rapidly identifies 
favorable regions of the protein surface, (ii) simplifies visualization of residues 
modulating binding in these regions, and (iii) provides a means of directly visualizing 
buried volume, which is often inferred indirectly from curvature in a surface 
representation. PASS produces output in the form of standard PDB files, which are 
suitable for any modeling package, and provides script files to simplify visualization in 
Cerius2®, Insightll®, MOE®, Quanta®, RasMol®, and Sybyl®. PASS is freely available to 
all. 

Keywords: protein active site, binding site, cavity detection, buried volume, molecular 
modeling, computer-aided drug design 
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Introduction 



The identification and visualization of protein cavities is the starting point for 
many structure-based drug design (SBDD) applications. Sites of activity in proteins 
usually lie in cavities, where the binding of a substrate typically serves as a mechanism 
for triggering some event, such as a chemical modification or conformational change. 
Consequently, binding sites are often targeted in attempts to interrupt molecular 
processes via therapeutics. Although binding site locations are often furnished by x-ray 
data or fold recognition, tools that automatically predict these locations have become 
quite popular in SBDD, especially as front-ends to molecular docking or when alternate 
binding sites are sought [1,2]. The size and shape of protein cavities dictates the 
three-dimensional geometry of ligands that can strongly bind there; i.e. they must fit like 
a hand in glove. Thus, a minimal requirement for drug activity is that the molecule 
sterically fit the region of buried volume inscribing the active site cavity, with some 
allowance for induced fit. The determination and visualization of these volumes is 
critical in drug design, particularly since manual intervention is still fruitfully employed in 
most design scenarios. An ordinary stick representation of a protein, unfortunately, 
provides little insight regarding the location, shape, or size of its buried volumes. While 
surface representations [3, 4] are a step in the right direction, they still fall short in that 
they require the user to infer buried volumes from often-occluded void space. 
Consequently, methods for direct display of regions of buried volume in proteins have 
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become prevalent in recent years [5-11]. Moreover, as molecular docking and virtual 
screening become more predictive and prevalent, the possibility of interfacing such 
tqols with functional genomics via threading or homology modeling becomes 
increasingly tempting. A versatile tool that can rapidly predict binding sites should, 
therefore, find a niche as a front-end to such automated screening efforts. This paper 
describes a program called PASS (Putative Active Sites with Spheres), which may 
serve both as an interface to virtual screening and as a visualization aid for manual 
molecular modeling. 

Methods 

The PASS algorithm is designed to fill the cavities in a protein structure with a 
set of spheres and to identify a few of these spheres (called "active site points", ASPs) 
that most likely represent the centers of binding pockets. Crevice filling is performed in 
layers using three-point Connolly-like [3] sphere geometry. An initial coating of probe 
spheres is calculated with the protein as substrate, then additional layers of probes are 
accreted onto the previously found probe spheres. Only probes with low solvent 
exposure are retained, and the routine finishes when an accretion layer produces no 
new buried probe spheres. Although physical arguments can be made to substantiate 
PASS'S success in binding site prediction, the algorithm itself is purely geometrical (see 
Figure 1). 
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Calculation of Probe Spheres 

PASS begins by reading the Protein Data Bank (PDB) coordinates of a target 
protein and assigning elemental atomic radii (Table 1). Since a protein with explicitly 
represented hydrogen atoms contains less interstitial volume than one without 
hydrogen, PASS assigns a few different parameter values in the two cases. By default, 
if less that 20% of the atoms in the protein PDB file are hydrogen, then all hydrogen 
atoms are removed and hydrogen-free parameters are assigned; otherwise, hydrogen 
is retained and hydrogen-inclusive parameters are assigned (Table 1). The first layer of 
probe spheres is computed by looping over all unique triplets of protein atoms and, if 
they are close enough together, calculating the two locations at which a probe sphere 
(of radius R pro J may lie tangential to all three protein atoms (Fig. 1 ; Step a). Appendix 
A elucidates this three-point geometry, which is nontrivial since the radii are not 
necessarily equal. To be retained, a putative probe sphere must survive several filters 
(Fig. 1 ; Step b). The first condition is that it cannot overlap with any atoms of the 
accretion substrate. The second filter explicitly prohibits the probe from clashing with 
any protein atoms, while the third ensures that the probe be somewhat buried within the 
protein (i.e. in a binding-site-like region). In particular, each probe sphere is ascribed a 
"burial count" (BC) representing the extent to which it is excluded from solvent (Figure 
2). The BC of a probe is computed by counting the number of protein atoms that lie 
within a radius R BC =8A of it, and the probes are filtered such that any probe sphere with 
BC less than a threshold value (BC mr8ShoUJ ) is rejected. This threshold value was 
determined empirically, as were many of the PASS parameters, by visual inspection of 
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results for a few test systems. Our experience has been that PASS'S predictions are 
largely insensitive to the precise values of any of its parameters. Finally, probe spheres 
are "weeded" such that no two probe centers lie any closer together than R weed = 1A. 
This keeps the distribution of probe spheres from becoming clumped, which enables 
reliable prediction of active site points from the final set of probes. 



Table 1 • PASS Parameters 



Parameter 


Rpro*. hydrogen-free 


1.8A 


BC lhrMhoia hydrogen-free 


55 


Rprc with hydrogen 


1.5 A 


BC lhrMteM with hydrogen 


75 


R BC 


8.0 A 


Rwd 


1.0 A 


D 

accretion 


0.7 A 


R 0 


2.0 A 


D„ 


1.0A 


p 


8.0 A 


PW 


1100 


Elemental Radii [40] 


Hydrogen 


1.20 A 


Oxygen 


1.52 A ' 


Nitrogen 


1.55 A 


Carbon 


1.70 A 


Sulfur 


1.80 A 



Values of PASS parameters, which are defined 
as follows. R prob() - Radius of a probe sphere. 
BC mr.B,oid - Threshold burial count (BC) 
distinguishing a buried probe from an exposed 
one. Rg C - Radius used to compute burial 
counts. R^ - Minimal separation between 
probe spheres. R accrl0o „ - Radius of probes as 
they are accreted onto existing probes. R 0 , D 0 - 
Parameters defining the probe weight (PW) 
envelope function (see Fig. 2). R ASP - Minimal 
distance between active site points (ASPs). 
PW m ,„ - Minimal PW for an ASP. 
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After the seminal layer of probes is computed, additional layers of spheres are 
iteratively accreted onto the existing probe spheres. At each iteration, a set of new 
probe spheres is computed as described above (Fig. 1 ; Steps c,e), but with a smaller 
probe radius (R accnstion = 0.7 A) and with the set of all probe spheres retained from 
previous layers as the accretion substrate. New probes, however, must still maintain a 
center-to-center distance of at least R probe + o, from each protein atom, i (of radius o). 
The aforementioned filters are imposed when the newly-found spheres are combined 
with those retained from previous layers (Fig. 1; Step d). PASS continues the accretion 
phase until a layer is encountered in which none of the newly-found probe spheres 
survives the filters (Fig. 1; Step 0- The result of this procedure is that the cavities, 
invaginations, and internal voids in the protein are filled with a set of fairly evenly- 
spaced probe spheres, all of which are buried and none of which sterically clashes with 
the protein. Furthermore, probes lying along the protein surface are packed in ideal 
steric contact with three protein atoms. 

Active Site Point (ASP) Determination 

PASS subsequently identifies a small number of "active site points" (ASP) from 
amongst the final set of probe spheres (Fig. 1; Step g). These ASPs are meant to 
represent potential binding sites (i.e. centers of putative active sites) for ligands of 
arbitrary shape and polar character. Thus, PASS conservatively views a protein 
binding site as simply an invagination in the protein surface that is large enough to 

accomodate a iigand and possesses substantial solvent-excluded volume in which 
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hydrophobic ligand moieties may be buried. ASPs are accordingly selected by 
identifying the centraf probes in regions that contain many spheres with high BCs. In 
particular, each probe is assigned a "probe weight" (PW), which is proportional to the 
number of probe spheres in the vicinity and the extent to which they are buried. The 

Nprobes , * 

probe weight of the i m probe is given by PW(i)z £ BC(j) /(|r,. -r y |), where the 

envelope function, f(r), is shown in Figure 2. This is conceptually similar to the 
solvation term of Stouten et al. [12], the premise of which is that the solvation energy of 
an atom varies linearly with its exposure which, in turn, is proportional to the 
unoccupied volume around it. The final ASPs are determined by cycling through the 
probes in descending order of PW, keeping only those with PW > PW min (=1 100) that 

are separated by a minimum distance R ASP (= 8A) from the ASPs already identified. 
Finally, the set of ASPs is rank-ordered according to PW values. These are PASS'S 
predicted binding sites. 

PASS Output 

The default PASS output consists of (i) a PDB file containing the final set of 
probe spheres, (ii) a PDB file of the ASPs, and (iii) a separate PDB file for each ligand 
that was optionally read in (see below). By default, PASS "smoothes" the probe 
spheres before writing the final set of "display" probes to a PDB file. In particular, only 
probes with at least 4 display probes lying within 2.5A are written to file by default. 
Smoothing removes all but appreciable groupings of probe spheres, leaving the final 
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visualization less cluttered. Smoothing can be suppressed via the command-line flag [- 
all]. PASS also produces visualization scripts for several popular molecular modeling 
packages; namely, Cerius2*[13], lnsightll*[14], MOE*[15], Quanta®[16], RasMol®[17], 
and Sybyl®[18]. These scripts, which are optionally produced via command-line flags 
(e.g. [-Insightll]), simplify visualization by automatically loading, rendering, and coloring 
the protein, probe spheres, ASPs, and ligands. PASS also displays detailed runtime 
information, including parameter settings, an account of sphere calculation and filtering 
(e.g. Table 2), and final probe sphere and ASP data, including BCs and PWs. PASS 
can also read the coordinates of bound ligands, either automatically from the protein 
PDB file (as HETATM entries with different residue names), or as separate files via the 
command-line flag [-ligand <filename.pdb>]. For each ligand, PASS computes the 
distance from each ASP to the nearest ligand atom and to the ligand center of mass. 
Other command-line options enable the user to (i) produce an enhanced set of probe 
spheres and ASPs ([-more]), (ii) repress production of the probe sphere PDB file ([- 
noprobes]), (iii) treat water molecules as part of the protein ([-water]), rather than 
ignoring them (which is the default behavior), (iv) specify an explicit output path ([-outdir 
<directory_path>]), (v) produce a set of PDB files containing subsets of the final probe 
spheres that were produced in the various layers of sphere calculation ([-layers]), and 
(vi) compute the volumes of all groupings of probe spheres left after smoothing ([- 
volume]). None of these options slows PASS noticeably except the volume calculation, 
which proceeds as follows. After probe smoothing, the final set of display probes is 
agglomeratively clustered [19] by iteratively merging pairs of overlapping groups of 
probes until an iteration attempts to join two non-overlapping clusters. This determines 
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both the optimal number of probe groups and the identities of spheres in these groups. 
Group volumes are subsequently computed by looping over probe spheres and 
estimating the volume increments statistically. If ligand(s) are present, distances are 
computed from the center of each group (i.e. the cluster center) to (i) the nearest ligand 
atom (D near ), and (ii) the ligand center of mass (D C0M ). 

Results 

Table 2 shows the numbers of probe spheres retained at various stages of a 
PASS calculation on thermolysin (1hyt) and is meant to provide an impression of the 
practical operation of the algorithm. In layer #1 of the probe sphere calculation, the 
protein atoms constitute the accretion substrate, and every set of three protein atoms 
lying close enough together to be simultaneously touched by a single sphere (of radius 
R prob J must be identified and used to determine two putative probe sphere positions. 
The number of atomic triples that must be tried is reduced by first identifying atomic 
neighborhoods. The "neighborhood" of atom "i" is the set of atoms lying close enough 
to "i" to be bridged by a single probe sphere. In layer #1 , 769,205 triples of protein 
atoms satisfied the neighborhood criterion, and 1,154,010 "bridging spheres" were 
located using these triplets. The number of bridging spheres is less than twice the 
number of atomic triples because not all triples of atoms in the accretion substrate that 
satisfy the neighborhood criterion can actually be bridged by a sphere of radius R probe . 
The set of bridging spheres is then filtered according to (i) clash with the accretion 
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substrate, (ii) clash with the protein, (iii) burial count, and (iv) proximity to other probe 
spheres, in that order. After the substrate clash filter, 2,151 putative probe spheres 
remain and, since the protein is the accretion substrate in layer #1, the same number 
remains after the protein clash filter. All but 81 1 putative probes are discarded based 
upon insufficient burial, and 360 remain after these 81 1 are "weeded" to maintain a 
mutual separation of at most ^^=1 .0 A. Thus, 360 probe spheres are found in the 
first layer. The accretion substrate for the second and subsequent "accretion" layers is 
the set of probe spheres. In layer #2, the substrate of 360 probe spheres requires that 
384 substrate triples be tested, from which 560 bridging spheres are identified. After 
applying the four filters, only 60 new probe spheres remain, bringing the total number of 
probes to 420 after layer #2. This process is repeated until layer #7, in which no new 
probe spheres are identified, signalling the completion of probe sphere determination. 
Note that although the number of probe spheres continually grows as accretion 
procedes, the number of accretion substrate triples that must be tried in each layer 
plateaus. This is because PASS is written such that only triples of substrate atoms 
incorporating a newly-found probe sphere (or the neighbor of a freshly-weeded probe) 
are tried. As a result, PASS'S performance scales favorably with protein size 
(approximately MW 3 * over the molecular weight range in Table 3). 

PASS was first tested for its ability to identify known binding sites. Table 3 
shows the results of applying PASS to 30 protein-ligand complexes drawn from the 
PDB. The structures were chosen based upon diversity, resolution, inclusion in 
previous theoretical studies, and the existence of corresponding apo-protein x-ray 
structures in the PDB. In each case, hydrogen-free PASS parameters were assigned 
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and bound water molecules were ignored. For each PDB complex, Table 3 shows the 
number of layers of probes PASS computed prior to convergence, the final number of 
probe spheres, the number of ASPs identified for each protein structure, and the 
required CPU time. Coordinates of the known ligand(s) are used to define a binding 
site "hit." In particular, for each ASP of a particular protein, two quantities are 
computed: (i) D N9ar , the distance from the ASP to the nearest ligand atom, and (ii) D C0M , 
the distance from the ASP to the ligand center of mass (COM). Any ASP with D Near < 4A 
is considered a binding site "hit." The Binding Site Hits column lists the rank order of 
the ASP(s) that are considered hits, and the values in the D u and D™, columns 

IN 03" COM 

correspond to these hits. For instance, the "1hvr" row in Table 3 indicates that both the 
top ASP and the second-ranked ASP lie near the site in HIV-1 protease known to bind 
XK263. In particular, the top ASP lies 1 .2 A from the nearest XK263 atom and 2.3 A 
from the COM, while the second-ranked ASP lies 0.8 A from the nearest atom and 6.3 
A from the COM. Note that ligand size impacts the D C0M values, as evidenced by the 
trypsin-PTI system, which has the largest ligand (a protein) and, correspondingly, the 
largest D C0M values (- 19 A). 

Table 3 shows that PASS is able to successfully identify the locations of known 
binding sites in complexed x-ray structures. PASS located the pocket containing a 
known ligand in all but three of the 32 trials, often finding multiple binding site hits for a 
given ligand (1 1 times). In addition, the top-ranking ASP identified by PASS represents 
a binding site hit in 19 of the 32 trials, and one of the top three ASPs is a hit in 26 trials. 
These observations indicate that PASS can usually identify the protein cavity to which a 
ligand will bind with maximal affinity in a matter of seconds. There is a strong, but not 
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perfect, correlation between ASP rank (i.e. PW) and the volume of the corresponding 
group of probe spheres. In fact, volume is approximately as predictive of binding sites 
(results not shown) as ASP rank for the systems in Table 3. However, the calculation 
of volumes slows PASS noticeably for systems requiring many probe spheres (e.g. 92, 
40, and 24 seconds for 1jst, 3aah, and 1etr, respectively). 

From a drug design perspective, the analysis presented in Table 3 is somewhat 
immaterial, since the existence of complexed coordinates implies that at least one 
binding site location is already known. Intuition suggests that the presence of a ligand 
in a complex might induce a more pronounced binding site cavity than would be 
present in an apo-protein structure, thereby biasing a cavity-detection algorithm like 
PASS to succeed on complexed systems. Thus, the postdiction of binding sites in PDB. 
complexes does not establish the predictive utility of a tool for drug design, where one 
is lucky to have an apo x-ray structure or reliable homology model. 

A more realistic test of PASS as 'a tool for prediction is to try to locate known 
binding sites on the structures of proteins that are not complexed with a ligand. We 
address this predictability issue by using PASS to compute ASPs for the set of apo- 
protein structures from the PDB that correspond to complexed PDB structures in Table 
3. Apo structures were identified for as many of the systems in Table 3 as possible 
(20), and default PASS parameters were used in all calculations. A few of these PDB 
correspondences are not identical residue-by-residue because the molecules either 
were obtained from different sources (1npc/1hyt; 2apr/2er6), had residue additions or 
deletions at the termini (1swb/1stp; 1hxf/1dwd), or had incomplete or missing residues 
due to poor electron density (5dfr/4dfr; 1hxf/1dwd). For comparison, the results 
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displayed in Table 4 are presented in the same order as in Table 3, and corresponding 
PDB codes are shown. "Known" binding site positions are determined by superposing 
tfje native and complexed structures and computing the proximity of the ASPs (from the 
native PASS calculation) to the known ligand (from the complexed crystal structure). 
This enables binding site "hits" to be computed as in Table 3, along with the distances 
D Near and D C0M relating the position of the known ligand to the binding site hits. Only 
backbone atoms {C,0,C o ,N} were superposed and, in all but a few cases (see Table 4 
caption), all residues in the chain were used. To quantify how severely the ligand 
deforms the protein in the binding site, we computed the RMSD between superposed 
structures using only residues lying in this region. In particular, we identified both the 
set {C} of residues lying within 4 A of the ligand in the complex and the set {A} of 
corresponding residues in the superposed apo structure. The RMSD between {CJ and 
{AJ was then computed, using both side chain and backbone atoms for identical amino 
acids and only the backbone atoms otherwise. 

Table 4 shows that PASS can reliably predict binding site locations when only an 
apo x-ray structure is known. PASS correctly identifies the binding site in 17 of the 21 
trials in Table 4. The top-ranked ASP hits the binding site in 12 trials, and one of the 
top three ASPs is a hit in 16 trials. These observations imply that PASS may be a 
suitable front-end to virtual high throughput screening and fast docking routines. 
Furthermore, the similarity of observed hit rates between the apo-protein and 
complexed systems refutes the hypothesis that the presence of a ligand in the 
structural data is a crucial determinant of success for a cavity detection algorithm. 
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One additional option available in PASS is the generation of an enhanced set of 
probes and ASPs by running PASS in "more" mode via the [-more] command-line flag. 
Iq "more" mode, the burial count threshold is slightly reduced (by 10), which typically 
has the effect of enhancing the number of probe spheres by about a factor of two and 
ASPs by a factor of two or three, at the expense of about 20-30% in cpu time. When 
the systems in Tables 3 and 4 are analyzed in "more" mode, the binding site is detected 
in every case, with no ASP hit ranking worse than ninth. Tables 3 and 4 show (in 
parentheses) the ASP hits obtained in "more" mode for the few binding sites that the 
default PASS calculation failed to locate. Detailed inspection revealed that several of 
these default-mode misses contained an accumulation of probe spheres that fell just 
beneath the threshold defining an ASP. Running PASS in "more" mode is suggested 
when broad binding sites are anticipated (e.g. protein-protein association). 

The work of Mattos and Ringe [1 , 20] constitutes the experimental analog of 
PASS and enables the most direct comparison of PASS to experimental data. In 
particular, Mattos and Ringe have soaked elastase crystals with a variety of small 
organic solvents and crystallographically determined the corresponding protein 
structures, including bound solvent molecules. These bound organic probes are meant 
to map out potential binding hot spots on the protein and suggest favorable ligand 
moieties. This raises the question of whether their organic probes tend to cluster in 
regions identified via PASS ASPs, which are likewise meant to identify possible hot 
spots. To address this, PASS was run on elastase and the resulting ASPs were 
graphically superimposed with Ringe et al.'s organic probes, along with a set of bound 
ligands drawn from the PDB. Figure 3 shows these results. Several clusters of organic 
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probes are observed, most notably a large grouping in the active site (S1 pocket). 
Although only one organic probe lies within 8A of the top- or second-ranked ASPs, 
PASS places an ASP near four of the five largest clusters of probes. The inset to 
Figure 3 shows that the third-ranked ASP (pale blue) lies in the active site about 5A 
above the catalytic triad (whose surface is colored green). 

Figure 3 also addresses the question of whether clusters of these experimentally 
derived organic probes are more predictive of binding sites than PASS ASPs. 
Superposition of the ligands from nineteen elastase PDB complexes enables this 
comparison. All but three ligands bind in the S1 region of the known active site. The 
other three stick solely to an alternate site about 10A away (near S3'), while four 
molecules employ both sites. PASS identifies this alternate binding site via the fourth- 
ranked ASP (white); however, since only one organic probe lies in this region, this site 
cannot be identified solely on the basis of organic probe clusters. Conversely, there is a 
cluster of organic probes near the S4 binding pocket, but no ASP is placed there (this 
region is too close to the ASP in the S1 pocket). Thus, clusters of the organic probes of 
Ringe et al. and the ASPs of PASS appear comparably predictive of the known binding 
sites in elastase. It should be noted that the physical nature of the probes employed by 
PASS and by Ringe et al. are drastically different, so one should not expect identical 
distributions of binding hot-spots in the two cases. Ringe et al. probe the protein 
surface with small, often quite polar, molecules, precisely the opposite of PASS ASPs, 
which can be thought of as large and apolar. ASPs are effectively apolar in that they 
are identified solely on the basis of cavity size, shape and burial, with no regard for e.g. 
electrostatics and hydrogen bonding. Moreover, the PASS parameters have been 
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tuned such that only a cavity of a certain critical size can sustain an ASP. Over the set 
of systems in Table 3, the smallest regions of buried volume containing an ASP are 
approximately the size of a benzene ring, while ASP regions that bind a ligand are 
typically three- to ten-fold larger than that. It is gratifying, however, that the central 
binding site (S1) is unambiguously identified by both methods. 
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Figure 3 - C mparison to Crystallographically-Determined Organic Probes 




PASS was run in "more" mode using a cross-linked structure of elastase provided by Ringe and Mattos. 
The resulting ASPs are rendered as large spheres and colored according to probe weight, PW (see 
scale). Crystallographically determined organic probes (acetonitrile, dimethylformamide, acetone, 
ethanol, isopropanol, hexenediol) are displayed as solid yellow sticks. Although only one organic probe 
lies within 8A of the top- or second-ranked ASP, four of the five largest clusters of organic probes lie in a 
region identified as a potential binding site by PASS. Every E.C.3.4.21.36 elastase complex in the PDB 
(19 structures, 20 ligands: 1bma, 1btu, 1eai, leas, 1eat, 1eau, 1ela, 1elb, 1elc, 1eld, 1ele, 1elf, 1elg, 
1esb, 1fle, line, 1jim, 1nes, 9est) was superposed onto the cross-linked elastase structure, and the 
resulting ligand overlays are shown as orange, blue, and magenta sticks (except for two protein-bound 
structures, 1eai and 1fle). The inset shows a top view of the protein surface at the active site, with the 
portion of the surface defined by the catalytic triad colored green. The third-ranked ASP (pale blue) is 
centrally located in the active site (S1 region), while the fourth-ranked ASP (white) identifies an alternate 
binding site about 1 0A away (S3' region). Only 4 ligands (two of which are proteins) bind to both sites 
(colored blue). Thirteen of the twenty ligands (colored orange) bind in the S1 pocket but not in the 
alternate site. The other three ligands (1elf, 1elg, 1nes; colored magenta) bind only to the alternate site. 
Since only one organic probe lies in this region, probe clusters alone cannot identify this as a potential 
small molecule binding site. Conversely, a cluster of three organic probes lies in the S4 region, in a 
pocket that PASS failed to identify because it lies too close (i.e. < R ASP =8 A) to the S1 ASP. 
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Discussion 

PASS in a Virtual Screening Environment 

The hit rates shown in Table 4 indicate that PASS may serve as a front-end to 
virtual screening when the binding site is unknown or when alternative binding sites are 
sought. If the screening tool is fast enough that docking against multiple sites is 
permissible, then separate screening calculations can be run with the search space 
centered on the top few PASS ASPs. This strategy should enable identification of the 
optimal binding mode in most cases, as evidenced by the 71% hit rate to the top two 
ASPs in Table 4. A number of other screening strategies incorporating PASS are also 
possible. For instance, a more rigorous procedure could be used to select the "true" 
binding site from amongst the full set of ASP predictions. Using a docking routine with 
a more detailed scoring function, the affinity of a ligand for the different ASP regions 
can be directly compared. Thus, screening a small set of diverse probe molecules or 
fragments against all the ASPs might enable one to identify the stickiest region of the 
protein by comparing the scores of the top binders to each ASP region. A large 
database of ligands could then be computationally screened against this region. Since 
ASPs are determined using only steric size and shape, the electrostatic (ES) and 
hydrogen-bonding (HB) character of the ASP sites is arbitrary. One might, thus, search 
these sites for novel pharmacophores and construct focused combinatorial libraries 
designed to hit them. Conversely, one could use ES and HB characterization of ASP 
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regions to select sites most likely to possess affinity for a given class of compounds. 
Perhaps the most alluring aspect of PASS'S speed is that it (i) permits the expeditious 
analysis of entire structural databases (e.g. PDB, corporate), and (ii) could provide a 
suitable bridge between 3D structural modeling and ligand docking in a future drug 
design project designed to make use of genomic data. 

PASS as an Interactive Visualization Tool 

A PASS calculation on a moderate-sized protein (~ 30 kD) takes less than 
twenty seconds on a single Silicon Graphics R10000 processor (Table 3). PASS is, 
therefore, fast enough to be used interactively in a molecular modeling environment, 
and has particular utility as a visualization tool for drug design. By default, PASS 
produces PDB files of probes, ASPs, and ligand(s) (when specified), which can be 
loaded and rendered separately using any molecular modeling package. Alternately, a 
full display of the PASS output can be produced in a single step (in supported modeling 
suites) by executing a PASS visualization script, which loads, renders, and colors the 
protein, probe spheres, ASPs, and ligand(s). ASP coloring denotes probe weight (PW), 
while the probe spheres can be colored according to either (i) burial count (BC), (ii) 
group identity (optionally invoked via [-group]), or (iii) the layer of accretion in which 
each was identified. Color values (0-50) are encoded onto the B-factor column of the 
output PDB files containing the probes and ASPs. In runs for which the probes are 
smoothed and grouped, an integer specifying the group membership of each probe 
sphere is encoded onto the occupancy column of the probe PDB file. Figure 4 shows a 
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Figure 4 - PASS Visualization of RNAse A 




RNAse A (1rob) is shown in green and is rendered as a tube for clarity, while the 
cytidylic acid ligand is rendered in white sticks and is barely visible. The final probe 
spheres, which have been smoothed, are represented by small spheres and colored 
according to burial count. Active site points (ASPs) are rendered as larger spheres and 
colored by probe weight. The second-ranked ASP lies in the binding site. 
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standard PASS visualization in Insightll® for RNAse A (1rob), which is rendered as a 
tube for clarity. The probes are rendered as small spheres and colored according to 
B.C, while the two ASPs are rendered as larger spheres and colored by PW. The 
ligand, cytidylic acid, is shown in white and is mostly occluded by probes and the 
second-ranked ASP. Because the ligand binds to a long groove in the RNAse surface 
rather than a deep pocket, the ASP lying in the true binding site has a lower PW than 
the one shown at the right, which lies in a rounder cavity. 

One advantage of PASS as a visualization tool is that displaying the ASPs 
relative to the protein enables immediate identification of regions likely to be of interest 
in drug design. Since the ASPs are centrally located in cavities, one can use the 
displayed ASPs and a distance-based criterion to quickly identify the residues 
modulating binding in these regions. For the modeling suites that support subseting 
(e.g. Insightll), the PASS visualization scripts automatically define 6 A, 8A, and 10 A 
residue-based subsets around each ASP, which facilitate the coloring and specific 
display of these regions. Figure 5 shows the 8 A subset of protein residues around the 
top-ranked ASP of trypsin (3ptb). The ASP is shown in magenta, while the probe 
spheres are colored by burial count. The residues involved in benzamidinium binding 
are captured in this subset; e.g. hydrogen-bond partners are indicated by yellow lines. 
The probe coloring clearly indicates that the mouth of the binding pocket lies to the 
right, where the probe spheres have lowest burial counts. Because PASS ASPs are 
centrally located in cavities, 6-10 A radial subseting almost always enables selective 
visualization of all the residues defining a protein cavity. 
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Figure 5 - Residues Modulating the Binding of Benzamidinium to Trypsin 




The residues lining the binding pocket of trypsin (3ptb) are rendered as sticks 
and colored according to atom type. They were selected by defining an 8 A 
residue-based zone centered on the top-ranked PASS active site point, shown 
in magenta. The bound benzamidinium is shown in white, while the probe 
spheres near the pocket are rendered as small spheres and colored according 
to burial count (BC). The BC color scale runs from blue (high BC) to red (low 
BC), with muted colors denoting intermediate values. Dashed lines represent 
hydrogen bonds between benzamidinium and trypsin residues (D189 and 
G219), with distances measured in Angstroms. 
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By identifying multiple ASPs, PASS also suggests alternate binding sites in 
proteins for which a primary site(s) of binding has already been established. The 
pursuit of alternate binding sites is becoming increasingly prevalent in light of the 
mounting realization that many proteins have more than one biochemical role [21] and 
are likely to employ separate binding sites in performing distinct biochemical tasks. In 
addition, many enzymes have allosteric binding sites that effect catalytic activity or 
substrate binding via the induction of conformational changes upon cofactor binding 
[22]. PASS can suggest the locations of such sites. Finally, the disruption of protein- 
protein interactions forms the basis of many drug design efforts, and PASS can be used 
to identify interfacial pockets that may be suitable targets for drug binding. In particular, 
interfaces may be identified by using probe spheres to compute a difference map 
between the bound and unbound forms. This approach can be extended to quickly 
identify and visualize packing contacts in protein crystals or multimeric forms. 

PASS also facilitates the visualization of buried volumes in a protein in that the 
space occupied by the manifold of probe spheres represents this volume, which can be 
viewed and manipulated as a solid object by rendering the probes in a space-filling 
model. Mesh or solid representations of various surfaces (molecular, van der Waals, 
Connolly) are often used to visualize the shape complementarity of a protein surface for 
putative ligands or functional groups. Often these surfaces are colored according to 
some other receptor-based property, such as electrostatics, hydrogen bond propensity, 
or surface curvature. The idea is that a modeler can use this sort of display to look for 
likely ligand hot-spots on the protein by visually searching the surface for voluminous 
invaginations that are colored to indicate favorable complementarity in, say, 
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electrostatic potential. In reality, ligands only bind to regions possessing enough buried 
volume to significantly accomodate them. Hence, buried volume is a quantity of central 
importance in drug design, and the development of methods for informatively displaying 
such regions should be accorded due attention. Surface representations fail to capture 
buried volumes directly in that the user is left to infer the buried volume from void 
space, much of which is obscured from view by the surface. Likewise, colored surface 
quantities are of most interest near deep invaginations, precisely where the surface is 
most difficult to see. Unfortunately, user expertise is typically required to overcome 
such difficulties. PASS takes a more direct approach by filling the buried volumes with 
a set of unbonded atoms that represent the ASPs and probe spheres. This enables 
both the size and shape of the buried volumes to be viewed directly, either with or 
without the protein, using any molecular visualization tool. Rendering the buried 
volumes as solid allows the user to eyeball the fit of certain ligands and groups to 
potential hot-spot regions. Figure 6 shows the region of buried volume (orange) lying in 
the binding cavity of retinol binding protein (1rbp), along with the bound retinol (white), 
some surrounding residues, and the top- and third-ranked ASPs (in magenta), on the 
left and right, respectively. Information equivalent to what is color-coded onto protein 
surface displays can, in principle, be captured by property-based coloring of probe 
spheres. For instance, the user could perform a finite-difference Poisson-Boltzmann 
calculation and color the probe spheres according to electrostatic potential, <)> 9S . Directly 
displaying <|) es in the region of interest, rather than having to infer it from <)) es at the protein 
surface, provides a more meaningful view of electrostatics than a surface 
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representation. Favorable hydrogen-bond donor and acceptor positions can likewise be 
more meaningfully defined within the manifold of probe spheres than on a protein 
surface. Interaction-based coloring schemes are not presently automated within PASS, 
however. 

Comparative Study 

Many procedures for characterizing and visualizing protein cavities have been 
presented in the past and, while all differ substantially from PASS, comparative study 
serves to highlight some of PASS'S strengths and weaknesses. First, almost all prior 
methods identify cavity regions using some type of regular grid [2, 5, 6, 8-1 1, 23-26]. A 
grid simply provides the coordinates of points lying in cavities, which are then used in 
some fashion to identify boundaries with the protein and, for all but internal voids, with 
empty space. One disadvantage of using a grid is that its storage consumes memory 
unnecessarily. Likewise, uncertainties arise relating to the possible dependence of 
results upon grid spacing or positioning. Orientational dependence was indeed found 
in the program POCKET [9, 24]. The advantage of implementing a grid is purely 
algorithmic, as there is no physical reason to use regular geometry when it is well 
known that protein packing and protein surfaces are extremely irregular [27], if not 
fractal [28]. The PASS algorithm captures this irregularity by using geometry to project 
outward from the known atomic coordinates in order to inscribe cavity regions. 
Although this sort of protein-based approach has been taken by other groups [7, 8, 29, 
30], the geometry employed in these studies differ significantly from PASS. Every point 
in a protein cavity may be thought to represent a sphere that lies exactly tangential to 

the protein surface. The radius of this sphere is the distance of closest approach, and 
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the sphere generally touches the protein at one, two, or three points (i.e. atoms). 
Several authors have used this correspondence (in reverse) to define points lying in 
cavity regions by specifying a set of probe spheres and using geometry (one-, two-, 
and/or three-point) to project outward from the protein atoms into the cavity region. For 
instance, cavity points have been obtained by placing tangential spheres midway 
between atoms [8, 30] and by rolling a probe sphere over the set of atomic spheres 
representing the protein [7, 10]. The resulting probe coordinates usually correspond to 
one or two points of tangency with the protein. However, the sterically optimal packing 
of a spherical probe against the protein has the probe lying tangent to exactly three 
atoms, just as a marble that is dropped onto a pile of other marbles will come to rest 
touching exactly three. Unlike any previous method, PASS uses only three-point 
geometry to obtain points lying in cavity regions. Consequently, the shape of the 
rendered manifold of PASS probes represents maximally favorable sterics. One might 
expect that positioning the probe spheres using only three-point geometry would give 
rise to a spotty distribution of probes and poorly-shaped buried volume. Practical 
experience has shown, however, that PASS produces smooth well-shaped buried 
volume manifolds (e.g. Figure 6), and that using only three-point geometry helps 
minimize the number of points required to fill protein cavities. 

The most ambiguous aspect of cavity characterization lies in deciding where to 
place the boundary between the pocket and free space; i.e. determining "sea-level" [8]. 
Several studies appearing in the literature [5, 6, 10] operate by filling fully-enclosed 
volumes (e.g. "flood fill") and, thus, require an artificial means of closing-off the mouths 
of cavities in order to define sea-level. With many other methods [8, 9, 23, 24], the 
definition of sea-level arises as a biproduct of the algorithm itself and has no physical 

significance. The work of Kuntz et al. [7] is closest in spirit to the present study with 
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regard to sea-level definition. Their method uses the Connolly surface as a substrate 
for sphere growth and rejects spheres based upon two criteria: (1) an angular 
condition, which essentially selects concave regions over flat or convex ones, and (2) a 
5A upper bound on radial sphere growth. Their radial constraint is expected to 
generate sea-level boundaries similar to those found with PASS. Unlike any other 
method of cavity detection, however, PASS explicitly defines sea-level according to a 
quantity of known physical significance, solvent accessibility, as quantified by burial 
counts (BC). 

Computational speed and ease of use are also important criteria for comparison 
and, in these categories, PASS rates favorably with all published methods. Although 
reliable speed comparison is difficult since few studies report CPU times [2, 8, 10, 26, 
30] and others report times on old processors [5, 7, 11, 29], the fastest CPU times 
reported in the literature belong to the LIGSITE program of Hendlich et al. [24], which 
can analyze a moderate-sized protein (at 0.5 A grid spacing) in about 15 seconds. This 
is approximately the same speed demonstrated by PASS; however, the LIGSITE CPU 
time ramps-up very steeply as the grid spacing is reduced (twelve-fold slower at 0.25 
A), and the authors provide only a cursory investigation of the dependence of their 
results upon grid scale. PASS also excels in useability in that it requires no startup cost 
to use because the inputs are simple and the outputs are standard. A few programs in 
the literature appear to have shared this design perspective [8, 23, 24, 29]. The input to 
PASS is restricted to a PDB file(s) specifying the protein(s) coordinates plus a few 
optional command-line flags that can be used to control more detailed behavior. PASS 
produces versatile output in the form of standard PDB files, which allows the user to 
immediately view the results using whatever modeling tool is already familiar. 
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Physical Und rpinnings 

Although the roots of the PASS algorithm are geometrical, not statistical 
mechanical, it is useful in light of PASS'S success in identifying known binding sites to 
examine a posteriori which physical interactions (if any) are mimicked in PASS. PASS 
takes the philosophy that the task of binding site prediction is to identify regions of 
space along the protein where an arbitrary ligand might tightly bind. A physically well- 
designed algorithm should incorporate as many contributions to binding affinity as 
possible without sacrificing applicability over a wide range of ligands. Binding affinity is 
dictated by the free energy change induced by the binding process, AG bin() , which is 
known to have numerous contributions, both enthalpic and entropic. While there is 
disagreement regarding some factors [31-33], sterics, electrostatics, hydrogen-bonding, 
and solvation are known to be major players [34-38]. Of course, the fine details of 
ligand size, shape, flexibility, hydrogen-bonding propensity, and polar character are 
crucial determinants of AG bind ; however, the observation that proteins usually bind 
ligands strongly at only a few sites suggests that one might be able to use coarse 
details of ligand character (e.g. size) to identify these few binding sites. Thus, PASS 
must make its predictions using only binding affinity contributions that depend upon 
coarse ligand character. Two important contributions to AG bjnd fit this description: 
solvation and sterics. Ligand binding is always favored entropically by the desolvation 
of molecular moieties, regardless of polarity [39]. This is because the hydration of any 
atomic group causes net ordering in the first few solvation shells of surrounding water. 
The PASS algorithm mimics this desolvation effect via the rejection of probe spheres 
based upon burial count. Likewise, the formation of steric (i.e. enthalpic van der Waals) 
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contacts between ligand and protein is generally favorable, regardless of the ligand. 
Although the steric contribution to AG bin(j depends upon detailed molecular shape, the 

hardness of the steric interaction precludes any ligand from binding tightly to the protein 
without adopting a configuration consistent with the size and shape of the buried 
vblume. PASS includes sterics by imposing an implicit size and shape criterion upon 
which regions of buried volume can be identified as active site points (ASPs). In 
particular, a region of buried volume that is either too small or too narrow to contain 
even a small ligand without steric clash will never contain an ASP because too few 
probe spheres will lie in the region for any one to have a large enough probe weight to 
be selected as an ASP. The PASS parameters (esp. R 0 and PW min ) have been 
empirically tuned to make this distinction reliably. 

Similar arguments cannot be made regarding the electrostatic interaction, for 
instance, which may contribute either attractively or repulsively to AG^, depending 
upon ligand charge and polarity. Several other programs in the literature, however, 
implement energetics in an effort to use other factors (e.g. hydrophobicity, 
electrostatics) to help identify and rank potential binding site cavities [2, 5, 26]. Most 
notably, Ruppert et al. present the most impressive results in the literature with regard 
to accuracy in locating binding sites [2]. Their method uses an in-house empirical 
forcefield to dock three different types of probes (steric, H-bond donor, H-bond 
acceptor) against the protein binding site. This maps out a set of favorable "probe" 
positions and permits the identification of "sticky spots" on the protein, which are used 
as central points to carve-out individual pockets. Although they provide no CPU times, 
their algorithm requires significant docking and, thus, is probably considerably slower 
than PASS or LIGSITE. They apply this method to the prediction of binding sites in a 
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set of 1 1 PDB complexes and find that their top-ranked pocket contains the ligand in 
every case. Nine of these eleven cases, however, are included in the PASS test set 
(Table 3), and strikingly similar results are obtained with PASS. The top-ranked ASP is 
a binding site hit in eight of the nine overlapping trials, and the second ASP is a hit in 
the other case. Although factors such as electrostatics and hydrogen-bonding certainly 
contribute to the affinity of a ligand for a particular cavity, the perspective taken in PASS 
is that only the most ligand-independent contributions to binding (i.e. size, shape, and 
burial extent of cavities) should contribute to binding site prediction. Energetic factors 
that strongly modulate specificity should be addressed case-by-case, either manually 
by the user or via downstream software (e.g. docking). Thus, the PASS ASP regions 
are completely inclusive with regard to electrostatic and hydrogen-bonding character, 
with the intention that each will be reinvestigated individually in light of a particular 
application or desired complementarity. PASS'S success in predicting binding sites 
without electrostatics and hydrogen-bonding constitutes a remarkable restatement of 
the importance of solvation and sterics in binding. 

Conclusions 

PASS is a simple cavity detection tool that has utility in both virtual screening 
and interactive molecular modeling environments. PASS was shown to reliably predict 
the locations of known binding sites using a set of 20 apo-protein x-ray structures from 
the PDB, thereby establishing its utility as a front-end to fast docking and virtual 
screening. Furthermore, for the price of a thirty-second investment, PASS provides the 
user a meaningful view of the buried volumes in a protein, suggests alternate binding 



37 



sites, and simplifies detailed visualization of potential binding hot-spots. PASS is freely 
available in unix executable form (SGI Irix, SunOS, Linux) to all users via the Protein 
Data Bank web site under "PDB-related Software" (http://www.pdb.bnl.gov/pdb- 
docs/software.html). 
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App ndix A - Three-Point Sph r G ometry 

The sphere placement algorithm in PASS hinges upon solution of the following 
geometry problem. Given three "base" spheres (i, j, and k) of known positions (R,, R,, 
RJ and radii (o„ and oj, at what two positions (R p ) can a "probe" sphere of radius a p 
be placed so as to be exactly tangential to all three base spheres? We seek the 
general solution, in which none of the radii are necessarily equal and the coordinates of 
the base spheres are unconstrained. Figure A1 illustrates the situation: sphere 
perimeters are outlined, base sphere centers are labelled "i", "j", "k", the "base plane" (i- 
j-k) is shaded, the probe sphere is shaded and labelled "p", and vectors are denoted 
with uppercase lettering while points and distances are in lowercase. The global origin 
coordinates is labelled "O", while a local frame is defined by unit vectors fx', y', z'}. 
There are, in general, two solutions for R p , one on either side of the base plane. 
However, one must first impose several conditions to ensure the existence of a 
solution. If any pair {i,j} of base spheres are too far apart, the probe will be unable to 
bridge the gap, so one must first ensure that |Rj -R l |<a,.+(T y .+2cr p , and likewise for 

pairs {i,k} and {j,k}. One must also make sure that no base sphere lies entirely within 
the volume occupied by the other two. With these conditions satisfied, the coordinates 
R p of the two valid probe sphere positions may be written 

R p =R b ±/zz\ (A.1) 
where h is the height of the probe above the base plane, and z' is a unit normal to this 
plane. To be precise, the local coordinate frame fx', y', z'} is right-handed, with x' 
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lying along R,-R, and z' pointing out of the base plane in the direction of x' x (R^R,). 
The right triangle i-b-p gives the height 



h^fa+aJ-frt-Rtf ■ (A.2) 

The vector R b from O to the point of projection of the probe onto the base plane, b, can 
be written vectorially as 

R b =R,+(T u -R,>-U, (A.3) 

which leaves T„ and U undetermined. In general, point b need not lie on the interior of 
triangle i-j-k, as drawn, but the equations are the same in either case. U can be 
eliminated from Eqn. (A.3) by observing that 

UC^-R^V-C^-R,), (A.4) 

where V = T lk - T,,, and U points in the direction of y'. Solving Eqn. (A.4) for U yields 

u= (T, k -T, J )(T -R.) 

The remaining vectors (T v T ik , T Jk }, which run from O to points {t,,, t ik , y, are found by 
considering the triangles formed by two base spheres and the probe sphere. For 
instance, the triangle i-j-p comprises two right triangles, i-t^-p and j-t^-p. Applying the 
Pythagorean theorem to each enables determination of the distance from i to t,, via a 
quadratic equation, which yields the desired vector 



1= x( Ri+R > fe- : >-^;^) p ( Rj - Ri .), 



T„ =i (R, + R, k— '.' x ' (Rj - R, j- (A.6) 

2 | R J- R .| 

Swapping indices in Eqn. A.6 gives analogous equations for T fc and T ik . The normal 
vector, n, to the plane of tangency (a-p-y) may also be of interest: 

"=C • x r j +K r j x r k + a+ jp r k x r . + (^ r - + <*l*j + a n r k ) x r J, (A. 7) 
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where <x* = a b ±a a , C = o^ala]^ , and n is not of unit magnitude. 



Figure A1 - Sphere Geometry 




Q 
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Abstract-To streamline the preclinical phase of pharmaceutical development, we have explored the utility of 
structural data on 

the molecular target and synergy between computational and medicinal chemistry. We have concentrated on 
parasitic infectious 

diseases with a particular emphasis on the development of specific noncovalent inhibitors of proteases that play a 
key role in the 

parasites' life cycles. Frequently, the structure of the enzyme target of pharmaceutical interest is not available. In 
this setting we 

have modeled the structure of the relevant enzyme by virtue of its sequence similarity with proteins of known 
structure. For 

example, we have constructed a homology-based model of falcipain, the trophozoite cysteine protease, and used 
the computa- 
tional ligand identification algorithm DOCK to identify in compuo enzyme inhibitors including oxalic bis(2- 
hydroxy-l-naphthyl- 

methylene)hydrazide (1) [Ring, C. S.; Sun, E.; McKerow, J. H.; Lee, G.; Rosenthal, P. J., Kuntz, I. D.; Cohen, F. 
E., Proc. Natl 

Acad. Sci. U.S.A. 1993, 90, 3583]. Compound 1 inhibits falcipain (IC,~, 6 ~M) and the organism in vitro as 
judged by hypoxanthine 

uptake (ICs~, 7 ~tM). Following this lead, to date, we have identified potent bis arylacylhydrazides (ICs. 150 
nM) and chalcones 

(IC~o 200 nM) that are active against both chloroquine-sensitive and chloroquine-resistant strains of malaria. In 
a second 

example, cruzain, the crystallographically determined structure of a papain-like cysteine protease, resolved to 
2.35 A., was avail- 
able. Aided by DOCK, we have identified a family of bis-arylacylhydrazides that are potent inhibitors of cruzain 
(ICso 600 ~tM). 

These compounds represent useful leads for pharmaceutical development over strict enzyme inhibition criteria in 
a structure- 
based design program. Copyright © 1996 Elsevier Science Ltd 
Introduction 

More and more successful examples of the design of 
new ligands based on knowledge of target protein 
structure have been reported. These include design of 
noncovalent antiparasitic agents,' 3 rational design of 
sialidase-based inhibitors of influenza virus replication, 4 
design of nonpeptide cyclic ureas as HIV protease 
inhibitors, ~ structure-based discovery of inhibitors of 
thymidylate synthase, ~ and structure-based design of 
inhibitors of purine nucleoside phosphorylase. 7 10 Most 
structure-based drug design relies on X-ray crystallo- 



graphy/NMR spectroscopy" to obtain the appropriate 
structures to identify new leads and guide lead optimi- 
zation. Our work on the structure-based drug design of 
parasitic protease inhibitors has relied upon X-ray 
crystal structures when available (e.g., cruzain nl3 for 
anti-Chagas disease agents), but has relied on a 
homology-based model structure I ' if neither X-ray nor 
NMR data are available (e.g., falcipain for antimalar- 
.ials) to generate leads and guide our lead optimization. 
Proteases are involved in many important biological 
processes including protein turnover, blood coagula- 
tion, complement activation, 1~ hormone processing, ~ 
*Present address: Ar Qule Pharmaceuticals, Inc., 200 Boston Avenue 
Medford. MA [)2 155. U.S.A. 
and cancer cell invasion, 1" Thus, they are frequently 
chosen as targets for drug design and discovery. A 
potential strategy for the treatment of diseases caused 
by parasites is the design of compounds which select- 
ively inhibit enzymes that are pivotal for survival of the 
parasite within the host that are part of biochemical 
pathways that are specific to the parasite. Parasite 
proteases are attractive target enzymes because of their 
roles in replication, metabolism, survival and 
pathology.~7 

In the most simple terms, structure-based drug design 
methods identify favorable and unfavorable inter- 
actions between a potential inhibitor and target 
receptor and maximize the beneficial interactions to 
increase binding affinity. Although X-ray crystallo- 
graphy continues to be the source of high-resolution 
information about protein structures, considerable 
delays often exist between determining the sequence of 
a protein and solving its structure. Difficulties in 
protein expression and more commonly in protein 
crystallization are often responsible for such delays. 
Currently, no general method exists to predict tertiary 
structure from amino acid sequences. However, when a 
protein target is relatively highly homologous to 
another protein or group of proteins of known struc- 
ture, a sensible model structure can be proposed. ~s 
1421 
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The World Health Organization estimates that 280 
million people are infected with malaria ~9 and 1-2 
million deaths are reported annually. 2° While various 
classes of antimalarial agents are available, chloroquine 
and its derivatives remain the mainstay of therapy 
against malaria. Unfortunately, the emergence of 
malarial parasite strains resistant to chloroquine has 
eroded its efficacyfl I This increases the urgency of the 
search for novel and cost-effective agents to treat 
chloroquine-resistant malaria. 
Chagas disease is caused by Trypanosoma cruzi, a 
protozoan parasite that afflicts more than 24 million 
people in South and Central America. It is the leading 
cause of heart failure in many Latin American 
countries. Currently, there is no satisfactory treatment 
for this parasitic infection. Cruzain, the major cysteine 



protease present in T. cruzi, is pivotal for the develop- 
ment and survival of the parasite within the host cells. 
This makes the enzyme a target for potential trypano- 
cidal drugs. 

In our previous reports, 1-3 we described a structure- 
based approach to inhibitor design for antimalarial 
drug development using models of falcipain, a malaria 
^ophozoite cysteine protease structure. Here, we 
summarize our progress to date in the structure-based 
drug design of parasitic protease inhibitors. 
Results and Discussion 
Lead discovery and optimization 
A structure for the malaria cysteine protease, falcipain, 
was proposed using the X-ray structures of papain and 
actinidin, two cysteine proteases from plant sources, as 
a basis for homology modeling? Falcipain has 33% 
sequence identity with both papain and actinidin. 
Moreover, -60% of the conserved sequence centers 
around the active site regions. The homology-based 
model of the enzyme provides the template and the 
DOCK 22 algorithm calculates a set of spheres with 
approximately atom sized radii to fill the active site 
cleft. Within DOCK, the quality of a given compound's 
fit into the binding cleft can be evaluated based on its 
shape complementarity (contact score) or molecular 
mechanics interaction energy (AMBER force-field 

score). 

The model structure was then used as a template for a 
DOCK search of the Fine Chemicals Directory of 
commercially available small molecules for putative 
ligands (the Fine Chemicals Directory distributed by 
Molecular Design Limited Information System, San 
Leandro, California, is currently known as the Avail- 
able Chemical Directory). When searching a database 
of compounds, DOCK examines only the "best' orienta- 
tion of the small molecule within the binding cleft 
(DOCK database screening mode). When a. single 
compound is studied, multiple possible binding modes 
can also be examined (DOCK single mode). Of course, 
the initial orientation of the compound is dictated in 
part by the irregular lattice of sphere centers identified 
originally. To overcome some of the scoring distortion 
that this bias could impart, a rigid body minimization 
algorithm has been developed to move the ligand 
within the binding cleft and optimize the shape or 
forcefield scores. 23 Compound 1 [oxalic bis((2-hydroxy- 
l-naphthylmethylene)hydrazide)] was selected based on 
its score for shape complementarity. Thirty-one 
compounds were finally tested and a lead compound 1 
was identified as the best inhibitor of the protease. The 
IC5o value for enzyme inhibition against the substrate 
benzyloxycarbonyl-Phe-Arg-(7-arnino-4-methylcouma- 
rin) was 6 i.tM. 3 More importantly, this compound 
inhibits the growth of parasites in culture. Malaria 
lacks some of the enzymes required for de novo purine 
biosynthesis and thus depends on purine salvage 



pathways for DNA replication. Compound 1 inhibits 
parasite growth as judged by its ability to block hypox- 
anthine uptake, with an apparent IC5o value of 7 pM.3 
Compound 1 fits a model of the active site of the 
malarial cysteine protease as shown in Figure 1.1-3 The 
DOCK 22 program placed this lead compound 1 into the 
enzyme's active site, presumably filling three of the 
.substrate side-chain specificity pockets (subsites $2, $1' 
and to a lesser extent Sa). Beginning with compound 1 
as shown in Scheme 1, the following chemical modifi- 
cations were made in an attempt to identify more 
active agents: (i) The length of the backbone linking 
the aromatic rings of 1 was shortened via the construc- 
tion of asymmetric acylhydrazides, which could have 
less conformational heterogeneity than the symmetric 
hydrazides, yet still could fill at least two of the three 
putative subsites (5 in Table 1). Compounds can be 
constructed by attaching a third aryl group to the 
center aromatic moiety to fill all three putative subsites 
(6 in Table 1). (ii) Heterocyclic acylhydrazides were 
generated by incorporating nitrogen atoms into 
aromatic rings both to improve water solubility of the 
compounds and potentially to enhance electrostatic 
interactions with His67 in the $2 site, (iii) To increase 
the chemical/metabolic stability of the compounds, a 
four-atom hydrazide linker was replaced with a three- 
atom ~,13-unsaturated ketone bridge (7 and 8 in Table 
1). (iv) Naphthalene, quinoline or isoquinolinc rings 
were exchanged for substituted phenyl rings on both 
acylhydrazide and a, 13 -unsaturated ketone linkers to 
explore the effective size and electronic character of 
the putative subsite specificity pocket. 
Chemistry, antiparasitic activity and inhibition 
specificity 

Since cost of production is a critically important 
consideration if the resulting antimalarials and other 
antiparasitic agents are ever to be developed into 
therapeutic agents for the world's developing countries, 
one of our guidelines for the development of antipar- 
asitic agents is that these compounds should be 
inexpensive to produce. Hence, we developed relatively 
simple chemistry to prepare both arylacylhydrazide and 
chalcone derivatives. For both series, the final step is 
the condensation of an aldehyde with either acylhydra- 
zines via imine formation 2 or substituted methylketones 
Parasitic protease inhibitors 1423 

via a Claisen- Schmidt condensation. 1 Since there are a 
variety of commercially available aldehydes and 
methylketones as starting materials, a large number of 
target compounds can be produced relatively inexpen- 
sively. For preparation of the key acylhydrazine inter- 
mediates, a published procedure was followed) 
After some 400 bis arylacylhydrazide and chalcone 
derivatives were synthesized, they were screened 
against three different antiparasitic screening systems. 
Antimalarial activities were evaluated based on an 



assay of parasitemia of red blood cells quantitated by a 
fluorescence-activated cell sorter (FACS) analysis 24 and 
the more classical assay of metabolic viability, hypox- 
anthine uptake, 1 described below. 
For the FACS analysis, synchronized trophozoite-stage 
parasites were cultured in human blood at various 
inhibitor concentrations. The parasites were allowed to 
mature, the host cell was lysed and parasite invasion of 

'Figure 1 . A putative binding orientation of the lead' compound (1) bound to the active site of the malarial cysteine protease. Key 
residues and 

the binding subsites of the protease are colored as cyan (S; site), yellow (S/catalytic site), and purple (SJS~ site). For the lead 
compound, carbon 

is shown in green, oxygen in red, and nitrogen in blue. 
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flesh red blood cells was investigated. Using propidium 
iodide to stain DNA, the FACS can discriminate 
between infected and uninfected cells and between 
stages of intraerythrocytic parasite development as only 
infected red blood cells contain DNA. 24 The hypoxan- 
thine uptake system assessed the intrinsic antimalarial 
activity in vitro against the erythrocytic asexual life 
cycle (blood schizontocides). Two Plasmodium falci- 
parum clones, CDC/Indochina III (W2) and 
CDC/Sierra Leone I (D6), 2~ were used for all anti- 
malarial assays. W2 is resistant to chloroquine, quinine, 
and pyrimethamine and susceptible to mefloquine. D6 
is resistant to mefloquine and susceptible to chloro- 
quine, quinine and pyrimethamine. The resistance 
indexes are defined as ratios of the IC5, of a compound 
against W2 to the ICS. of the same compound against 
D6. This index is used as a factor to evaluate whether 
or not novel antimalarials are potential agents against 
chloroquine-resistant parasites. Chloroquine and 
mefloquine were used as controls in the assays. The 
third screening system involved the assay of two . 
cysteine proteases, cruzain and cathepsin B. ~ Enzymatic 
activity of these two proteases was measured by 
following the cleavage of fluorogenic substrates. 
Structure-activity relationships (SAR) for some bis 
arylacylhydrazide and chalcone derivatives were 
reported previously, t2 Here, we summarize antiparas- 
itic activity and enzyme inhibition of a group of hydra- 
zide and chalcone derivatives. As shown in Table 1 , the 
lead compound 1 showed little specificity in both 
antimalarial activity and enzyme inhibition. Cathepsin 
B is a mammalian cysteine protease from bovine spleen 
with 87% sequence identity to a human cathepsin B 
and papain is a cysteine protease from papaya. For 
comparison, inhibition activities of both cathepsin B 
and papain by hydrazide and chalcone derivatives are 
included in Table 1 . Compound 2 is a more specific 
inhibitor of cruzain, with an IC~, value of 600 nM, than 
the rest of the compounds shown in Table 1 . 
Compound 5 is the best in vitro antimalarial found in 
the acylhydrazide series with an IC~. value of 150 riM. 
The most noteworthy example is compound 6. The ICsO 
value as an antimalarial is 450 nM while the IC~, value 
as a cathepsin B inhibitor is at least 400-fold higher 
(Table 1). This suggests that compound 6 is a more 



specific inhibitor of the malarial cysteine protease and 
will be less likely to be toxic to the host. Compound 7 
is the best in vitro antimalarial compound found in the 
chalcone series to date with an IC~, value of ~ 200 nM. 
The differences in inhibitory specificity shown above 
may stem from the fact that the active site residues of 
these cysteine proteases are quite different, as shown 
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in Table 2, especially the residues from the SiS3 site 

and part of the Sdcatalytic site. The most likely binding 

orientations of inhibitors involve extensive interactions 

between inhibitor molecules and the three binding 

subsites. The highly conserved SI' and catalytic binding 



subsites provide basic interactions for binding, while 
the diverse residue types in the SiS3 binding subsite 
among different proteases result in binding specificity. 
A thorough discussion of structural implications of 
binding specificity will be.presented in a separate 
manuscript. '~ 
Conclusions 

Structure-based drug design typically depends upon the 
•experimental determination of the target structure by 
either X-ray crystallography or NMR spectroscopy. We 
have circumvented this step and relied exclusively on 
the amino acid sequence homology between the 
Table 1. Antiparasitic activity and enzyme inhibition of some hydra- 
zide and chalcone derivatives 

Compds Structure Antimalarial Cruzain b Cathepsin Papain ~' 
activity B" 

1~.~@7 '0(1)22(1) 19(1) 

% :',-"'-.„-.- 91 (20) 78 (20) 74 (2(1) 
2 ( ~ ' ~ " 0.6 ~ 20' 50' 

4 > 10' 30' 20' 
*-~NDNOND 

6 ~o 2.() ~ 200' ND 

7 ND ND ND 

>1& 
ND ~ 
ND 

0.1S(J~'~2 
(i.450~,. r..~ 

o.:30 

~~0.19@ ,.„ 

8 i 0.. 10.... 0(1)3(1 
0.720 ~. ~ 

6(1) 

"FACS assay. IC~, in j.iM. 

~'% Enzyme inhibition 0. 1 M inhibitor). 

~ND: not determined. 

aHypoxanthine uptake assay, IC~„ of W2 (chloroqulne-resistance 
strain) in laM. 

"Hypoxanthine uptake assay, IC~„ of D6 (chh)roquine-sensitive strain) 
in [aM. 
'IC~. In I—M. 

malaria enzyme and other cysteine proteases of known 
structure to support our antimalarial program. A 
homology based model of the malaria enzyme served 
as the template for a computer-based ligand docking 
calculation that identified a useful lead compound for a 
group of cysteine proteases in our antiparasitic 
program. Lead optimization was achieved by a 
combined approach of computational and synthetic 
analysis. Derivatives of the lead were first optimized 
for fit using the computer docking program, and then 
numerous candidate compounds were synthesized and 
tested experimentally. Despite the lack of a detailed 
experimental structure of the target enzyme or the 
enzyme-inhibitor complex, we have been able to 
identify compounds with increased potency. To date, 
we have identified a potent bis arylacylhydrazide, 5 
(IC50 150 nM), and a chalcone, 7 (IC-0 -200 nM), that 
are active against both chloroquine-sensitive and chlor- 



oquine-resistant strains of malaria. Due to the -30% 
sequence identity between cruzain and falcipain, a 
subgroup of compounds generated in the malaria 
project (selected using DOCK as a guide) were also 
screened against cruzain. The best cruzain inhibitor so 
far, hydrazidc 2, had an ICs„ value of 600 nM. 
Experimental 

Melting points were measured on a Thomas-Hoover 
'Unimelt apparatus and are uncorrected. TLC (silica gel 
60 GF254, Merck, Darmstadt) was used to monitor 
reactions and check product homogeneity. NMR 
spectra at 300 MHz for ~H and 75 MHz for ~3C (tetra- 
methylsilane as internal standard) were recorded on a 
General Electric QE-300 spectrometer. Chemical 
ionization mass spectrometry (CIMS) spectra were 
obtained at the UCSF Mass Spectrometry Facility, A. 
L. Burlingame, Director. Elemental analyses were 
performed by the University of California, Berkeley, 
Microanalytical Laboratory and were within _+0.4% of 
the theoretical values. All starting materials were 
purchased from Aldrich Chemical Company, Inc. 
General procedure for condensation of aldehyde with 
hydrazine (Method A for 2 - 6 ) 
To a solution of the aldehyde (1 mmol) in methanol 
(20 mE) was added the corresponding acylhydrazine (1 
mmol). The resulting mixture was heated at reflux at 
65 °C for 3 h. In most cases, a precipitate was observed 
after 10 min. The precipitate was filtered, washed with 
Table 2. Active site residues of cysteine proteases 
SI S~/catalytic site 82/J 

Proteases 19 177 175 25 159 67 68 69 133 160 205 207 
Papain Gin Trp Ash Cys His Tyr Pro Trp Val Ala Set Phe 

Actinidin — ' - lie Thr Ala - - Met Ser 

Falcipain His - - Phe Asn Ser Glu - - 

Cruzain Leu Met Asn Ala Gly Glu Ser 

Cathepsin B Glu - - Ser Ala - - Glu Val 

"Same residue wpe as papain.. 
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hot methanol (50 mL) and dried under vacuum to give 
a solid (usually yellow). If needed, additional purifica- 
tion was performed by recrystallization using appro- 
priate solvents. 

Preparation of acylhydrazines from the corresponding 
acid or ester (Method Al) 

A mixture of the acid (10 mmol) and coned H2SO 4 (5 
mL) in MeOH (25 mL) was heated at reflux for 12-24 
h. The mixture was poured onto ice (100 mL). The 
resulting precipitates were collected, washed with HzO 
and recrystallized from EtOH:H20 to give the pure 
methyl ester. The methyl ester was then dissolved in 
EtOH (80 mL) and treated with hydrazine monohy- 
drate (5.01 g, 100.0 mmol). The resulting mixture was 
stirred overnight at 20°C and then concentrated to 
give the corresponding acylhydrazine, which was 
further purified by recrystallization from EtOH/H20. 
Compound 1 [oxalic bis(2-hydroxy-l-naphthylmethyle- 
ne)hydrazide] was reported previously. 2 Chalcone 



derivatives 7 [l-(2,5-dichlorophenyl)-3-(4-quinolinyl)- 

2- propen-l-one] and 8 [l-(3,4-dimethoxyphenyl)- 

3- (3-(2-chloroquinolinyl))-2-propen- 1-one] were 
prepared via a Claisen-Schmidt condensation also 
described previously) 

2'-Hydroxy-6 , -methoxynaphthylacyl-(2.hydroxy-l.naph- 

thylmethylene)hydrazide (2). A 70% yield; mp 

•>280 °C; ~H NMR (DMSO-d6): 8 12.84 (s, 1 H), 12.20 

(br s, 1 H), 1 1.20 (br s, 1 H), 9.59 (s, 1 H), 8.46 (s, 1 

H), 8.36 (d, 1 H, J= 8.4 Hz), 7.96 (d, 1 H, J= 9.0 Hz), 

7.91 (d, 1 H, J=8.1 Hz), 7.74 (d, 1 H, J=9.0 Hz), 7.63 

(t, 1 H, J=. 6 Hz), 7.43 (t, 1 H, J=7.5 Hz), 7.28 (d, 1 

H, J=9.0 Hz), 7.22 (dd, 1 H, J=2.0, 9.1 Hz), 3.90 (s, 3 

H); ~3C NMR (DMSO-d6): 8 163.08, 158.14, 155.73, 

152.16, 147.48, 132.89, 131.71, 131.48, 129.25, 128.93, 

127.81, 127.76, 127.72, 127.44, 123.56, 121.31, 121.00, 

1 19.94, 1 18.21, 1 10.94, 108.65, 106.45, 55.17; CIMS: m/z 

(MH -) 387.3. Anal. (C23H18N204): ealcd: C, 71.49; H, 

4.70; N, 7.25. Found: C, 71.09; H, 4.84; N, 7.23%. 

2'-Hydroxynaphthylacyl. (2, 8-dihydroxy- 1-naphthylmethy- 

lene)hydrazide (3). A 88% yield; mp 274 °C (dec); IH 

NMR (DMSO-d6): 8 14.03 (s, 1 H), 12.54 (br s, 1 H), 

1 1.31 (br s, 1 H), 10.47 (s, 1 H), 10.37 (s, 1 H), 8.53 (s, 

1 H), 7.96 (d, 1 H, J=8.1 Hz), 7.87 (d, 1 H, J =9.0 

Hz), 7.80 (d, 1 H, J= 8.4 Hz), 7.54 (t, 1 H, J = 7.5 Hz), 

7.39 (m, 3 H), 7.22 (m, 2 H), 7.08 (d, 1 H, J = 7.2 Hz); 

13C NMR (DMSO-d6): 8 163.80, 159.39, 154.34, 153.30, 

153.02, 135.93, 133.51, 130.26, 130.06, 128.66, 128.28, 
126.73, 125.91, 123.82, 123.71, 121.72, 120.50, 1 19.80, 
119.43, 112.51, 110.67, 108.87; CIMS: m/z (MH') 

373.3, Anal. (C=H16N204"0.2H20): calcd: Q 70.28; H, 
4.40; N, 7.45. Found: C, 70.45; H, 4.58; N, 7.40%. 
2'-HydroxynaphthyIacyl-(2, 5-dihydroxy- 1 -naphthylme- 
thylene)hydrazide (4). A 74% yield; mp >280 °C; ~H 
NMR (DMSO-d6): 8 12.81 (s, 1 H), 12.23 (s, 1 H), 

11.29 (s, 1 H), 10.22 (s, 1 H), 9.52 (s, 1 H), 8.52 (s, 1 
H), 8.20 (d, 1 H, J=9.2 Hz), 7.59 (d, 1 H, J=8. 1 Hz), 
7.80 (d, 1 H, J=8.2 Hz), 7.71 (d, 1 H, J=8.6 Hz), 7.54 
(t, 1 H, J=7.5 Hz), 7.41 (m, 3 H), 7.16 (d, 1 H, J - 9 . 2 
Hz), 6.80 (d, 1 H, J=7.5 Hz); 13C NMR (DMSO-d6) 8 

163.04, 158.38, 154.10, 153.89, 148.09, 135.93, 133.39, 
130.52, 128.66, 128.52, 128.32, 127.01, 126.78, 125.84, 
123.85, 119.84, 118.93, 117.08, 111.40, 110.60, 108.29, 
106.46; CIMS: rn/z (MH +) 373.2. Anal. 
(C22Ht6N204"0.5H20): calcd: C, 69.28; H, 4.49; N, 7.35. 
Found: C, 69.35; H, 4.54; N, 7.27%. 

2',4'-Dihydroxyphenylacyl- (2-hydroxy-l-naphthylmethyl- 

ene)hydrazide (5). A 81% yield; mp >280°C; ~H 

NMR (DMSO-d6): 8 12.83 (s, 1 H), 12.18 (br s, 1 H), 

12.01 (br s, 1 H), 10.35 (br s, 1 H), 9.54 (s, 1 H), 8.32 

(d, 1 H, J = 8.6 Hz), 7.88 (m, 3 H), 7.62 (t, 1 H, J= 7.6 

Hz), 7.42 (t, 1 H, J = 7.4 Hz), 7.26 (d, 1 H, J=8.9 Hz), 

6.47 (d, 1 H, J=8.8 Hz), 6.43 (s, 1 H); ~3C NMR 



(DMS0-d6): 8 164.58, 162.94, 161.90, 158.02, 146.90, 
132.72, 131.67, 129.96, 128.94, 127.80, 127.71, 123.54, 
120.87, 178.92, 108.66, 107.78, 106.14, 102.93; CIMS: 
m/z (MH +) 323.1. Anal. (C18HI4N204): calcd: C, 67.08; 
H, 4.38; N, 8.69. Founch C, 67.04; H, 4.53; N, 8.91%. 
2' - Hydroxy-4'- (4-nitrobenzyloxy) phenyacyl- (2,4-di- 
hydroxy-l-naphthylmethyIene)hydrazide (6). A 91% 
.yield; mp >270 °C; 1H NMR (DMSO-d6): 8 12.82 (s, 1 
H), 12.46 (br s, 1 H), 1 1 .90 (br s, 1 H), 1 1 .03 (br s, 1 
H), 9.39 (s, 1 H), 8.25 (d, 1 H, J=8.1 Hz), 8.17 (d, 1 H, 
J=8.6 Hz), 8.10 (d, 1 H, J=8.3 Hz), 7.91 (d, 1 H, 
J=8.8 Hz), 7.71 (d, 1 H, J=8.2 Hz), 7.58 (t, 1 H, J 
= 7.6 Hz), 7.34 (t, 1 H, J= 7.5 Hz), 6.91 (d, 1 H, J= 8.8 
Hz), 6.61 (s, 1 H), 6.60 (s, 1 H), 5.33 (s, 2 H); ~3C 
NMR (DMSO-d6): 8 164.23, 162.47, 162.10, 160.37, 
157.62, 147.94, 147.07, 144.41, 132.92, 129.41, 128.27, 
128.27, 123.66, 123.66, 122.93, 122.50, 120.67, 120.28, 
107.75, 107.10, 102.39, 101.34, 100.46, 68.23; CIMS: m/z 
(MH +) 474.2. Anal. (C2sH19N307"2/3H20): calcd: C, 
61.85; H, 4.12; N, 8.66. Found: C, 61.77; H, 4.33; N, 
8.67%. 

Biological assays 

The previously published FACS assay protocol was 
followed for in vitro antimalarial testing. 24 Hypoxan- 
thine uptake and enzyme (cruzain, cathepsin B and 
papain) inhibition procedures were also published 
previously? 
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ABSTRACT The lack of an experimentally determined 
structure of a target protein frequently limits the application of 
structure-based drug design methods. In an effort to overcome 
this limitation, we have investigated the use of computer 
model-built structures for the identification of previously un- 
known inhibitors of enzymes from two major protease families, 
serine and cysteine proteases. We have successfully used our 
model-built structures to identify computationally and to con- 
firm experimentally the activity of nonpeptidic inhibitors di- 
rected against important enzymes in the schistosome [2-(4- 
methoxybenzoyl)-l-naphthoic acid, K, = 3 /xM] and malaria 
{oxalic bis[<2-hydroxy-l-naphthylmethylene)hydrazide], IC M 
= 6 fiM} parasite life cycles. 



Proteases are involved in many important biological pro- 
cesses including protein turnover, blood coagulation, com- 
plement activation (1), hormone processing (2), and cancer 
cell invasion (3). Thus, they are frequently chosen as targets 
for drug design and discovery. Noteworthy examples include 
the design of angiotensin-converting enzyme inhibitors for 
the treatment of hypertension (4) and programs to develop 
human immunodeficiency virus protease inhibitors to block 
proliferation of the AIDS virus (5). The critical role proteases 
play in the life cycle of parasitic organisms also makes them 
attractive drug-design targets for these infectious diseases 
(6). 

In the most simple terms, structure-based drug' design 
methods identify favorable and unfavorable interactions be- 
tween a potential inhibitor and target receptor and maximize 
the beneficial interactions to increase binding affinity. Ob- 
taining an accurate structure for the receptor or ligand- 
receptor complex is a logical step in this process. X-ray 
crystallography continues to be the source of high-resolution 
information about protein structures. However, considerable 
delays often exist between determining the sequence of a 
protein and solving its structure. Difficulties in protein ex- 
pression and more commonly in protein crystallization can 
delay x-ray structure determination. 

Currently, no general method exists for predicting tertiary 
structure from amino acid sequences. However, when a 
protein target is homologous to another protein or group of 
proteins of known structure, a sensible model structure can 
be proposed. Recent comparisons between model and crystal 
structures permit an assessment of the overall accuracy 
expected from homology model-built structures (7-9). For a 
sequence that is 80% identical to a protein of known struc- 
ture, the expected rms deviation of the core residues is =0.6 
A (10). The expected rms deviation increases to 1.8 A when 
the sequences are only 20% identical. However, model-built 
structures could still be useful in finding previously unknown 
lead compounds despite the uncertainties in the lower part of 
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this range if the errors cluster far away from the enzyme 
active site. 

The proteases targeted for inhibitor design in this study are 
important in establishing schistosome infection or necessary 
for the maintenance of malarial infection. Schistosomiasis is 
a snail-borne disease that is contracted by individuals who 
come into contact with the parasites in infested waters. 
Infectious larvae (cercariae) secrete an elastase to invade the 
skin of the human host and initiate infection. Once in the 
circulatory system, the schistosomes mature and reproduce. 
Thousands of eggs become trapped in the portal circulation 
of the liver, and the host immune response leads to portal 
hypertension. The protease that is implicated in skin pene- 
tration has been purified and characterized, and preliminary 
studies suggest that cutaneous application of an inhibitor of 
the cercarial elastase might prevent infection (11). 

The increased incidence of drug-resistant strains of malaria 
(especially Plasmodium falciparum) necessitates the search 
for new therapies. Malaria infection includes an erythrocytic 
phase that is responsible for all the clinical manifestations of 
the disease (12). During this phase, erythrocytic trophozoites 
degrade hemoglobin as a principal source of amino acids. 
Rosenthal and coworkers (13, 14) have identified a critical 
cysteine protease that appears to be involved in the degra- 
dation of hemoglobin, the parasites' primary source of amino 
acids. Blocking this enzyme with cysteine protease inhibitors 
[L-fra/is-epoxysuccinylleucylamido-(4-guanidino)butane 
(E64), benzyloxycarbonyl-Phe-Arg-fluoromethyl ketone] in 
culture arrests further growth and development (15). Thus, 
this enzyme is a promising target for new modes of antima- 
larial chemotherapy. 

METHODS 

Model Construction. Three-dimensional models of the 
structures of cercarial elastase and trophozoite cysteine 
protease were built following the approach of Blundell and 
coworkers (16, 17). Seven mammalian serine proteases, 
bovine chymotrypsin (18), porcine pancreatic elastase (19), 
rat mast cell protease (20), human neutrophil elastase (21), rat 
tonin (22), porcine kallikrein (23), and bovine trypsin (24), 
were used to derive a structural alignment for cercarial 
elastase (25). Papain (26) and actinidin (27) were used for 
trophozoite cysteine protease. The conformations of side 
chains were retained when possible, and the statistically most 
likely rotamer was selected when no conformational infor- 
mation was available (17). Loops were placed by using a 
combination of the loop dictionary and key residue ap- 
proaches (28, 29). The resulting models were refined by 
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energy minimization with the amber potential function (30). 
Models were validated with several computational strategies, 
including qpack to probe side-chain volume (31), the profile 
method of Luthy et al. (32), a Ramachandran map analysis of 
backbone geometry, and solvent-accessibility calculations 
(33). 

Screening the Fine Chemicals Directory Using DOCK3.0. The 
two protease model structures were used as receptors for 
ligand docking. DOCK3.0 is an automatic method to screen 
» small-molecule data bases for ligands that could bind to a 
given receptor (34). DOCK3.0 characterizes the grooves and 
invaginations of the active site with sets of overlapping 
spheres. The generated sphere centers constitute an irregular 
grid that can be matched with the atom centers of a potential 
ligand. The quality of fit of a ligand to the binding site is 
judged either by shape complementarity or by a simplified 
molecular mechanics force-field energy (estimated interac- 
tion energy). 

dock 3.0 was used to search the Fine Chemicals Directory 
(Molecular Design Limited, San Leandro, CA) of 55,313 
commercially available small molecules. The structures of 
the small molecules were obtained computationally by using 
a heuristic algorithm, concord, developed by R. Pearlman at 
the University of Texas. coNCORD-generated structures are 
estimated to be =90% in agreement with those structures 
optimized by molecular mechanics calculations (35). The 
Fine Chemicals Directory was chosen over the Cambridge 
Structural Database of experimentally determined structures 
because of the ease with which interesting compounds could 
be obtained. 

In a typical dock search, the top-scoring 100-200 mole- 
cules are examined with 10-50 of these selected for experi- 
mental testing (36). Because model protein structures were 
used instead of crystallographically determined structures, 
an arbitrarily large number of small molecules were saved. 
For each enzyme system, the 2200 molecules with the best 
shape-complementarity scores and the 2200 with the best 
force-field scores were saved. The resulting 8800 compounds 
were visually screened in the context of the active site by 
using the molecular display software midasplus (37). 

Because of the uncertainties inherent in model-built struc- 
tures, the scores generated by DOCK3.0 did not influence the 
visual screening process. Instead, compounds were judged 
solely on how they might interact with the active site in the 
putative ligand-receptor complex. In an effort to be self- 
consistent, the resulting 8800 compounds were screened 
three times. No compounds were selected during the first 
screening in an attempt to get acquainted with the systems. 
During the second and third passes, compounds that filled the 
site and had potential hydrogen-bonding and electrostatic 
interactions were selected for further inspection. Only com- 
pounds that were chosen on both the second and third 
screenings were considered further. From this list, an effort 
was made to choose compounds that were chemically diverse 
and that appeared to interact with the receptor in different 
ways. Fifty-two compounds were ultimately chosen for test- 
ing against the cercarial elastase, and 31 compounds were 
chosen for testing against the trophozoite cysteine protease. 
This screening process took =»1 week of effort. As the 
enzyme-active sites became more familiar with each succes- 
sive pass, the time needed to examine the ligand-receptor 
complex shortened. 

Of the 52 compounds selected for the cercarial elastase, 33 
compounds were from the force-field list, 10 compounds 
were from the shape list, and 9 compounds appeared on both 
lists. Of the 31 compounds selected for the malarial protease, 
20 compounds were from the shape list, and 11 compounds 
were from the force-field list. These compounds were ranked 
as high as 4th and as low as 1939th (out of 2200) by the scores 
generated by DOCK3.0. 
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K\ Determination for the Inhibitors Against Cercarial 
Elastase, Chymotrypsin, and Elastase. Cercarial elastase was 
purified as described (38). Initial reaction velocities were 
determined at room temperature for each enzyme by using 
tetrapeptide thiobenzyl ester substrates in the presence of 20 
lM 4,4'-dithiopyridine and following the absorbance at 324 
nM for 1 min after enzyme addition (39). Enzyme concentra- 
tions were determined by active-site titration with chloro- 
methyl ketone inhibitors, and used at l/100th of the lowest 
substrate concentration. The reaction buffer was 100 mM 
glycine-NaOH, pH 9.0/2 mM CaCl 2 . The specific substrates 
used were N-succinylalanylalanylprolylphenylalanyl thioben- 
zyl ester for cercarial elastase and chymotrypsin, and N-suc- 
cinylalanylalanylprolylalanyl thiobenzyl ester for pancreatic 
elastase at concentrations from 25 to 500 /x.M. Inhibitors were 
prepared as 100 mM stock solutions in dimethyl sulfoxide and 
used at concentrations from 0 to 100 jiM. Reaction velocities 
were determined in triplicate for each point and plotted by 
using the method of Dixon. Data were also plotted using the 
Hanes transformation of the Michaelis-Menten equation to 
ascertain the competitive nature of inhibition. Aj was deter- 
mined directly from the Dixon plot (40) and confirmed by 
replots of ATSf/Vep from the Hanes plot (41). 

The Trophozoite Cysteine Protease Inhibitor Studies. En- 
zyme activity was measured with the fluorogenic substrate 
benzyloxycarbonyl-Phe-Arg-(7-amino-4-methylcoumarin) as 
described (15). Trophozoite extracts were incubated with 
reaction buffer (in 0. 1 M sodium acetate/10 mM dithiothrei- 
tol, pH 5.5) and an appropriate concentration of inhibitor for 
30 min at room temperature. Benzyloxycarbonyl-Phe-Arg- 
(7-amino-4-methylcoumarin) (50 fiM final concentration) was 
then added, and fluorescence (380 nM excitation, 460 iiM 
absorbance) was measured continuously over 30 sec. The 
slope of fluorescence over time for each inhibitor concentra- 
tion was compared with that of controls in multiple assays, 

a 

(D 



(ii) 



b 




Fic. 1. (a) (i) Naphthol blue-black, (ii) 2-(4-Methoxybenzoyl)-l- 
naphthoic acid, (b) Oxalic bis[(2-hydroxy-l-naphthylmethylene)hy- 
drazide]. 
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Table 1. K\ values for compounds that inhibit cercarial elastase 
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Naphthol blue-black 
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200 /iM 


2-(4-Methoxybenzoyl)-l- 








naphthoic acid 


3 nM 


30 iM 


146 iM 



and the ICjo was determined from plots of percent control 
activity over inhibitor concentration. 

Effect of Oxalic Bis[(2-hydroxy-l-naphthylmethylene)hy- 
drazide] on [ 3 HJHypo xanthine Uptake as a Measure of Parasite 
Metabolism. [ 3 H]Hypoxanthine uptake was measured based 
on a modification of the method of Desjardins et al. (42). 
Microwell cultures of synchronized ring stage P. falciparum 
parasites were incubated with inhibitor in dimethyl sulfoxide 
(10% final concentration) for 4 hr. [ 3 H]Hypoxanthine was 
added (1 pCi per microwell culture; 1 Ci = 37 GBq), and the 
cultures were maintained for an additional 36 hr. The cells 
were then harvested and deposited onto glass-fiber filters that 
were washed and dried with ethanol. [ 3 H]Hypoxanthine 
uptake was quantitated by scintillation counting. The uptake 
at each inhibitor concentration was compared with that of 
controls, and the IC 50 value was determined from plots of 
percent control uptake over inhibitor concentration. 

RESULTS AND DISCUSSION 

Nonpeptidic inhibitors were identified for both the cercarial 
elastase and the malarial cysteine protease. Approximately 
10% of the compounds tested, 5 of 52 for the cercarial elastase 
and 4 of 31 for the malarial protease, displayed activity 
against the enzymes at concentrations <100 yM. Among 
these, three compounds were inhibitors at concentrations 
<10 yM (Fig. 1). 2-(4-Methoxybenzoyl)-l-naphthoic acid and 
naphthol blue-black inhibited the cercarial elastase with K\ 
values of 3 and6^M, respectively (Table 1 and Fig. 2). These 
two compounds also displayed specificity for the cercariaf 
elastase, as evidenced by the generally higher K\ values 
against chymotrypsin and pancreatic elastase (Table 1). Be- 
cause the SI specificity pocket of cercarial elastase is more 
similar to chymotrypsin than to pancreatic elastase, it is not 
surprising that both 2-(4-methoxybenzoyl)-l-naphthoic acid 




0 10 20 30 40 
Inhibitor concentration, /iM 



60 



Fig. 2. Representative ft determination using the Dixon plot. In 
this example, the ft is determined for naphthol blue-black against 
cercarial elastase. Each point was determined in triplicate. Each line 
represents a different substrate concentration (a, 500 /*M; ♦, 200 
lM ; □, 50 fiM). Some error bars are too small to be graphed on this 
plot. V, velocity. 



Proc. Natl. Acad. Sci. USA 90 (1993) 3585 

and naphthol blue-black are also good inhibitors of chymo- 
trypsin. (Note that the amino acid residues on the acyl side 
of the scissile bond are denoted PI, P2, . . . Pn, and those on 
the leaving group side of the scissile bond are denoted as PI', 
P2', . . . Pn'. The corresponding binding sites on the enzyme 
are SI, S2, . . . Sn and SI', S2\ . . . Sn'.) Presumably, the 
application of standard medicinal chemistry strategies to 
these lead compounds will yield more potent and selective 
inhibitors of the schistosome enzyme. Topical application of 
peptide-based inhibitors has already been demonstrated to 
block parasite migration through the skin (11). 

Oxalic bis[(2-hydroxy-l-naphthylmethylene)hydrazide] in- 
hibited the trophozoite cysteine protease with an ICjo of 6 /xM 
(Fig. 3a). When tested against cultured P. falciparum, this 
compound also inhibited the incorporation of hypoxan thine, 
a standard marker of parasite metabolism, at approximately 
the same concentration (Fig. 3b). Because this compound can 
inhibit the protease and the parasite, efforts are underway to 
synthesize analogs of oxalic bis[(2-hydroxy-l-naphthylmeth- 
ylene)hydrazide] and examine their therapeutic potential. 

The visual screening process was reexamined for the most 
active compounds in an attempt to find the relevant factors 
responsible for their selection. An interesting dichotomy was 
observed in the dock shape-based and force-field scores. All 




10" 7 10" a 10- 

Inhibltor concentration, M 



10" 



100 



BO- 



S' 60 
8 40H 
20- 



0 
10 



10- 



' 10~ 8 10-* 10- 
Inhibltor concentration, M 



10" 



Fio. 3. (a) IC50 curve for oxalic bis[(2-hydroxy-l-naphthylmeth- 
ylene)hydrazide] against malarial cysteine protease. The points are 
the means of eight assays, and the error bars are the SDs of the 
samples, (b) Inhibition of parasite uptake of [ 3 H]hypoxanthine by 
oxalic bis[(2-hydroxy-l-naphthylmethyIene)hydrazide]. 
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but one of the five inhibitors of the cercarial elastase were 
members of the force list with the following rankings: 85th, 
2-(4-methoxybenzoyl)-l-naphthoic acid; 122nd, plasmo- 
corinth B; 627th, naphthol blue-black; and 918th, a-pheneth- 
ylphthalamic acid. The fifth compound, 9-fluorenone-4- 
carboxylic acid, appeared on both lists, ranking 561st on the 
force-field list and 1783rd on the shape-based list. The two 
best cercarial elastase inhibitors, 2-(4-methoxybenzoyl)-l- 
Viaphthoic acid and naphthol blue-black, ranked 85th and 
627th, respectively, on the force-field list. By contrast, all 
four of the malarial protease inhibitors were members of the 
shape-based list, ranking as follows: 7th, 3,3'-diethyloxatri- 
carbocyanine iodide; 13th, oxalic bis[(2-hydroxy-l- 
naphthylmethylene)hydrazide]; 793rd, cephaloglycin; and 
1193rd, l-(2-methoxyphenyl)-6-(4-trifluoromethylphenyl)-5- 
thiobiurea. The best inhibitor, oxalic bis[(2-hydroxy-l- 
naphthylmethylene)hydrazide], ranked 13th. These results 
may reflect the environmental differences in the active site. 
The active site of the malarial protease consists of a large 
hydrophobic cleft. Because of the absence of charged resi- 
dues in the vicinity of the putative binding site, the shape- 
based scores for hydrophobic ligands that fill the site may 
adequately estimate the enthalpy of interaction between 
ligand and receptor. By contrast, the active site of the 
cercarial elastase contains both a hydrophobic SI pocket and 
charged amino acids in the vicinity of the active site. Con- 
sequently, the force-field scores, which include both van der 
Waals and electrostatic components, better estimate the 



interaction energy of the ligands with the active site of the 
cercarial elastase. 

The DOCK-generated enzyme-inhibitor complex structures 
for naphthol blue-black and oxalic bis[(2-hydroxy-l- 
naphthylmethylene)hydrazide] are shown in Fig. 4. Naphthol 
blue-black fits into the groove defined by the SI, S2, and S3 
subsites of the cercarial elastase. In the model complex, 
ligand binding is stabilized by the interaction of a phenyl 
group with the hydrophobic SI pocket. The sulfonic acid 
groups could hydrogen-bond with arginines in a nearby loop 
or possibly with the solvent, Similarly, oxalic bis[(2-hydroxy- 
l-naphthylmethylene)hydrazide] interacts with S2 and SI' 
sites of the malarial protease. The hydrophobic specificity 
site, S2, is filled by a naphthol group. The other naphthol 
group participates in a stacking interaction with the indole 
ring of Trp-177 at the SI' site. In addition, each hydroxyl 
group on the naphthol rings appears to hydrogen-bond to 
Ser-160 at S2 and Gln-19 at SI'. These complexes are useful 
starting points for modeling ligand-receptor interactions, but 
other possible binding modes should also be considered. 

At micromolar concentrations, it is likely that the inhibitors 
will have multiple modes of binding to the enzyme. Because 
these different binding modes are approximately isoenergetic, 
discriminating among the plausible alternatives with current 
scoring functions is difficult. Assumptions, such as rigid 
ligands and rigid receptors, are necessary for computational 
tractability but are also presumably responsible for the loss of 
resolution in these scores. The x-ray structures of thymidylate 
synthase complexed with two different inhibitors that were 





Fig. 4. (Upper) Stereo image of naphthol blue-black docked into the active site of cercarial elastase. {Lower) Stereo image of oxalic 
bis[(2-hydroxy-l-naphthylmethylene)hydrazide] docked into the active site of trophozoite cysteine protease. Catalytic residues are colored 
purple and labeled for orientation. The atoms on the inhibitors are color-coded: carbons are white, oxygens are red, nitrogens are blue, and sulfurs 
are cyan. 
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suggested by dock illustrate the challenges presented in 
accurately predicting the ligand-receptor complexes (43). For 
sulisobenzone, the failure to anticipate the binding of a coun- 
terion in the binding site led to an inaccurate prediction of the 
complex. In the case of phenolphthalein, a conformational 
change by an arginine between unbound and bound states of 
the enzyme and the presence of two waters in the bound state 
led to a slightly different conformation of the ligand than the 
one anticipated by dock (43). These examples highlight the 
importance of crystallography to the structure-based drug- 
design process. Ligand-induced conformational changes and 
the presence of bound waters and counterions are details that 
may be necessary for successful lead optimization. 

The quality of the model structure is directly related to the 
percentage sequence identity between the relevant sequences. 
The trophozoite cysteine protease is =33% identical to both 
papain and actinidin, and the cercarial elastase is 20-25% 
identical to the seven mammalian serine proteases of known 
structure. Thus, we anticipate errors of 1-3 A rms deviation in 
the model atomic coordinates, although errors in the vicinity 
of the active site are probably substantially smaller, reflecting 
selective sequence conservation. Two explanations of the 
success of our modeling/docking approach are plausible. (0 
The modeling errors in the active site are small, and the major 
determinants of molecular recognition are faithfully recreated. 
(«) Alternatively, the modeling process was irrelevant, and a 
homologous structure could have been substituted for com- 
putational ligand-binding studies to identify lead compounds. 
To address the latter possibility, two homologous serine 
proteases, chymotrypsin and trypsin, were used as receptors 
for ligand docking. Chymotrypsin was chosen because it 
shares with cercarial elastase a similar PI specificity for 
hydrophobic residues. Trypsin was chosen because its SI 
pocket is sterically similar, despite its different peptide spec- 
ificity. With the same method, docio.o was used to search the 
Fine Chemicals Directory, and the top 2200 shape- 
complementarity scoring compounds and the top 2200 force- 
field scoring compounds were saved. 

The best two inhibitors of the cercarial elastase were not * 
included in either list of 4400 compounds predicted to inhibit 
chymotrypsin or trypsin, although each shape-based 'list 
included one of the less effective inhibitors. Due to unfavor- 
able interactions seen in the model of 9-fluorenone-4- 
carboxylic acid docked to chymotrypsin (negative charge in 
hydrophobic SI pocket), this compound would have been 
rejected during the visual-screening evaluation. Conse- 
quently, none of the five inhibitors identified for the cercarial 
elastase would have been found in a DOCK3.0 search by using 
the chymotrypsin active site, and only one of the 100 fj.M 
inhibitors, a-phenethylphthalamic acid, would have been 
found by using the trypsin active site. Although we cannot 
rule out finding other low-micromolar inhibitors from the lists 
of compounds generated by the chymotrypsin and trypsin 
searches, our results indicate that the modeling process was 
not irrelevant and that this method for inhibitor discovery is 
sensitive enough to differentiate between similar active sites 
in homologous structures. 

Despite the inherent limitations of computer model-built 
structures, these structures are helpful in finding nonpeptidic 
inhibitors active at low-micromolar concentrations. Al- 
though these compounds are far from being drugs, they are 
sensible starting points for the process of drug development. 
Because these enzymes are members of two major protease 
families, our work suggests that computer models and struc- 
ture-based drug-design methods can be applied to identify 
inhibitors of proteases that are relevant to other pathophys- 
iologic processes. 
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Summary 

New empirical scoring functions have been developed to estimate the binding affinity of a given protein-ligand 
complex with known three-dimensional structure. These scoring functions include terms accounting for van der 
Waals interaction, hydrogen bonding, deformation penalty, and hydrophobic effect. A special feature is that three 
different algorithms have been implemented to calculate the hydrophobic effect term, which results in three parallel 
scoring functions. All three scoring functions are calibrated through multivariate regression analysis of a set of 200 
protein-ligand complexes and they reproduce the binding free energies of the entire training set with standard 
deviations of 2.2 kcal/mol, 2.1 kcal/mol, and 2.0 kcal/mol, respectively. These three scoring functions are further 
combined into a consensus scoring function, X-CSCORE. When tested on an independent set of 30 protein-ligand 
complexes, X-CSCORE is able to predict their binding free energies with a standard deviation of 2.2 kcal/mol. 
The potential application of X-CSCORE to molecular docking is also investigated. Our results show that this 
consensus scoring function improves the docking accuracy considerably when compared to the conventional force 
field computation used for molecular docking. 



Introduction 

Considerable advances in structure-based drug design 
have made a significant impact on drug discovery 
processes in the past decade [1-5]. By utilizing the 
essential structural properties of the target macromole- 
cule, a variety of methods now exist for suggesting 
potential ligand molecules either by screening large 
chemical databases [6-10] or by assembling molec- 
ular fragments inside the binding site [11-18]. These 
methods usually suggest a large number of molecules 
rapidly, far too many for organic synthesis and bio- 
logical experiments. Therefore, a structure-based drug 
design approach tends to arrive at the bottleneck where 
it is necessary to select only the most promising can- 
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didates for further experimental characterization. The 
basic assumption underlying structure-based drug de- 
sign is that a good ligand molecule should bind tightly 
to its target. Thus, it is extremely valuable to predict 
the binding affinity of a given ligand to its target and 
use it as a criterion for selection. This is known as 
the 'scoring problem' and has attracted great interests 
in developing methods for binding affinity calculation 
[19-21]. 

A large group of methods calculate binding affini- 
ties through force fields. In early years, attempts have 
been made to calculate the direct interactions, e.g. 
steric and electrostatic interactions, between a lig- 
and and its target molecule and relate the force field 
energies to binding affinities [22]. This method is 
still popular nowadays especially among molecular 
docking studies. However, as many researchers have 
pointed out, the interaction energy computed in this 
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way is only an approximation to the enthalpy change 
in the binding process, therefore the application of 
this method is usually restricted to the analysis of a 
congeneric series of ligands. Some researchers have 
supplemented standard force fields with an additional 
term to address the solvation effect with either PB/SA 
or GB/SA method [23]. More ambitious methods, 
snch as free energy perturbation [24] and linear re- 
sponse approximation [25, 26], try to consider solvent 
molecules explicitly and deal with ensemble averages. 
In theory these methods are expected to give more ac- 
curate predictions. However, in practice they do not 
always meet this expectation due to the deficiency in 
the force field as well as in the sampling procedure. 
In addition, these methods are still computationally 
expensive even for today's computers, which has lim- 
ited their popularity in structure-based drug design 
practice. 

Following the pioneering work of Bbhm [27], a 
number of so-called empirical scoring functions have 
emerged as an alternative [28-32]. These approaches 
assume that the overall receptor-ligand binding free 
energy can be decomposed into basic components, 
which can be written out conceptually as: 

AGbind = AG mo tion + AGj n t era ction 

+ AGdesoivation + AG CO nfiguration- 

Usually those factors which are known to be impor- 
tant for the binding process are included in the above 
function. Unlike force fields, empirical scoring func- 
tions are not derived from 'first principle'. Instead, 
they are directly calibrated with a set of protein-ligand 
complexes with experimentally determined structures 
and binding affinities through multivariate regression 
analysis. Empirical scoring functions have several ap- 
pealing features. Firstly, since they are calibrated with 
diverse protein-ligand complexes, their applications 
are not limited to a certain congeneric series of ligands 
or a particular target receptor. Secondly, each term 
in an empirical scoring function has a clear physical 
meaning. Studying the regression coefficients before 
each term sheds lights on the understanding of the 
receptor-ligand binding process. Thirdly, at a light- 
ning speed, the accuracy level (~2 kcal/mol) that a 
current empirical scoring function can achieve in bind- 
ing affinity prediction is acceptable for structure-based 
drug design approaches. In recent years, empirical 
scoring functions have become more and more pop- 
ular among structure-based drug design applications 
in which very accurate binding affinity predictions are 



not necessary, such as virtual database screening and 
de novo ligand generation. 

We have extensive experience in applying several 
empirical scoring functions, including Bohm's scor- 
ing function [27], ChemScore [30] and SCORE [32], 
to structure-based drug design projects. Despite of all 
the encouraging results we have obtained with these 
empirical scoring functions, it is clear that there is 
still plenty of room for improvement in terms of ac- 
curacy as well as robustness. In this paper, we will 
describe our work on further development and val- 
idation of empirical scoring functions. Firstly, we 
have derived three scoring functions, each of which 
has only five adjustable parameters. These scoring 
functions are calibrated with a diverse set of 200 
protein-ligand complexes, which is the largest one 
ever used by an empirical scoring function approach. 
Secondly, inspired by the consensus scoring strategy 
[33], we combine these three scoring functions into 
a consensus scoring function, X-CSCORE, to ensure 
converged results in binding affinity prediction. This 
consensus scoring function is tested on an independent 
set of 30 protein-ligand complexes. Thirdly, we have 
also explored the potential application of X-CSCORE 
to molecular docking. When compared to conven- 
tional force field computation, this consensus scoring 
function performs considerably better in identifying 
the experimentally determined protein-ligand complex 
structures. 

Methods and results 

Training set construction 

Developing an empirical scoring function requires a 
set of receptor-ligand complexes for calibration. Both 
the size and the quality of the training set will affect 
the final form of the scoring function. In our selection 
of receptor-ligand complexes, we used the follow- 
ing five criteria to ensure the quality of the training 
set. (1) Only protein-ligand complexes are considered. 
Complexes involving other types of receptors, such as 
nucleic acids, are not included. (2) The ligand mole- 
cule should be a 'normal' organic compound and bind 
to the receptor non-covalently. Therefore, complexes 
containing covalently bound ligands, complex ligands 
(such as Heme), or large ligands (MW > 1000) are 
excluded. (3) There should be no cofactor binding be- 
side the ligand. (4) Crystal structure of the complex 
with a resolution better than 3.0 A should be avail- 
able from the Protein Data Bank (PDB) [34]. Complex 
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structures solved by NMR techniques are currently 
not included in our selection. (5) The dissociation 
equilibrium constant (K, or Kd) of the complex has 
been determined experimentally and can be found in 
literature. Complexes with only IC50 values are not 
accepted. 

The resulting training set has 200 protein-ligand 
cfimplexes, which comprises more than 70 different 
types of proteins. Basically, this training set is an as- 
sembly of the training sets used by other empirical 
scoring functions [27-32] plus our own collections. 
The experimentally determined binding affinities are 
cited either from those previous approaches or the ref- 
erences listed in the relevant PDB files. All binding 
affinities are expressed in the negative logarithms of 
dissociation constants, i.e. pKd, for convenience. In 
this training set, the pKd values range from 1.48 to 
11.42, covering nearly 10 orders of magnitude. Here 
we neglect the potential inconsistence in the dissocia- 
tion constants related to experiment conditions, such 
as PH level, temperature, and salt concentration. A 
complete list of the training set can be found in the 
supplementary material section in this paper. 

Coordinates of the complex structure in the train- 
ing set are downloaded from PDB. No minimization 
is performed to further adjust the structure. For the 
convenience of processing, each complex structure is 
processed in SYBYL [35]. First, the ligand is extracted 
from the complex, assigned proper atom and bond 
types, and then written out as a separate file in the 
MOL2 format. The remaining part of the complex, 
i.e. the protein, is written out into another file in the 
PDB format. Metal ions located inside the binding site 
are left with the protein and treated as part of it. All 
crystallographic water molecules and other cofactors 
are removed. 

Scoring functions 

We assume that the overall free energy change in a 
protein-ligand binding process can be dissected into 
the following terms: 

AG b ind=AG v dw + AG H . bond 

+ AGdeformation + AGhydrophobic + A Go- (1) 

Here, AG V dw accounts for the van der Waals interac- 
tion between the ligand and the protein; AGjj . bond 
accounts for the hydrogen bonding between the lig- 
and and the protein; AGdefomation accounts for the 
deformation effect; AG hy drophobic accounts for the 
hydrophobic effect; AGq is the regression constant 



which implicitly includes the effects due to the trans- 
lational and rotational entropy loss in the binding 
process. Detailed algorithms for calculating each term 
will be described below. 

(1) Atom classification. Besides element type and 
hybridization state, both ligand and protein atoms need 
to be classified to compute some of the terms in our 
scoring functions. The atom types defined in our study 
are: (i) H-bond donor. Oxygen and nitrogen atoms 
bonded to hydrogen atom(s) and metal ions located 
inside the binding site of the protein, (ii) H-bond 
acceptor. Oxygen and sp 2 or sp hybridized nitrogen 
atoms with lone pair(s). (iii) H-bond donor/acceptor. 
Oxygen and nitrogen atoms which may act as either 
H-bond donor or H-bond acceptor, such as the oxygen 
atom in a hydroxyl group, (iv) Polar atom. Oxygen 
and nitrogen atoms that are neither H-bond donor 
nor H-bond acceptor, sulfur and phosphorus atoms, 
and carbon atoms bonded to hetero-atom(s). (v) Hy- 
drophobic atom. Carbon atoms that do not belong to 
the 'polar atom' group and halogen atoms. 

The following set of atomic radii are used in com- 
putation: carbon, 1.9 A; nitrogen, 1.8 A; oxygen, 
1.7 A; sulfur, 2.0 A; phosphorus, 2.1 A; fluorine, 
1.5 A; chlorine, 1.8 A; bromine, 2.0 A; iodine, 2.2 A; 
metals, 1.2 A. This radii set is applied to both ligands 
and proteins. 

(2) Van der Waals interaction. The van der Waals 
interaction is one of the essential non-covalent inter- 
actions. We employ the Lennard-Jones equation to 
reflect the balance between the short-range repulsion 
and the long-range attractive dispersion force: 

li gan d protein 

VDW=Y, E VDW U 



ligand protein 

= E E 



dij ) \ dij 



(2) 



Here VDW denotes for the van der Waals interaction 
energy, which is calculated by considering all the atom 
pairs between the ligand and the protein; d^ denotes 
for the distance between the ligand atom i and the 
protein atom j; djj.o = n + rj, i.e. the sum of 
van der Waals radius of atom i and 7*. Note that we 
use a 'softer' form in Equation 2 instead of the stan- 
dard 12-6 equation. Furthermore, in our algorithm, 
(i) only heavy atoms contribute. Hydrogen atoms are 
neglected, (ii) All heavy atoms are weighted equally. 
No weight factor is used to differentiate them, (iii) To 
avoid the huge repulsion raised by overlapped atom 




Figure I. Illustration of the three geometric parameters used in 
characterizing a hydrogen bond. 

pairs, we set an upper limit of 100 for VDW/y in Equa- 
tion 2. For any pair of atoms, if VDW 0 exceeds this 
limit, it will be cut flat to 100. 

(3) Hydrogen bonding. Hydrogen bonding is per- 
haps the most important factor for the specific binding 
of a ligand to its receptor. Such interaction happens 
when two atoms get close enough and form a donor- 
acceptor pair. The geometry of a hydrogen bond, 
D-H- • A, is typically described by the bond length, 
i.e. the distance between the hydrogen atom (H) and 
the acceptor (A), and the bond angle, i.e. ZDHA. 
However, hydrogen atoms are normally not revealed 
in X-ray crystallography analysis. Although hydro- 
gen atoms can be added later, energy minimization 
is usually required to set them into position. This 
practice could become problematic especially when 
the hydrogen atoms can take multiple low-energy 
positions, such as the one in a hydroxyl group. Fur- 
thermore, minimized structures will depend on force 
field parameters and they may be incompatible with 
the initial experimentally determined ones. Therefore, 
we choose not to consider hydrogen atoms explicitly. 
Here we introduce the concept of 'root' : the root of an 
atom is its non-hydrogen neighboring atom. When an 
atom bonds with more than one non-hydrogen atom, 
its root locates at the geometric center of all its non- 
hydrogen neighboring atoms. Let DR denotes for the 
donor's root and AR for the acceptor's root. In our al- 
gorithm, the geometry of a hydrogen bond is described 
by: (i) the distance (d) between D and A; (ii) the angle 
(60 between DR, D and A; and (iii) the angle (0 2 ) 
between D, A and AR (Figure 1). 

We assume that a hydrogen bond has an ideal 
geometry and any deviation from it will weaken the 
strength of the hydrogen bond. The strength of a hy- 
drogen bond is then computed by considering these 
three geometric descriptors: 

HB, 7 =f(dij)f(Quj)f(62.ij). (3) 

The distance function f(d) and the angular functions 
/(9i) and /(0 2 ) in Equation 3 are written in the 



following simple linear fuzzy forms: 

/W = 10 d Q < 4, -0.7 A 

= (1/0.7) x (do -d) d 0 - 0.7A < d < do 

= 0.0 d > d 0 

/(9i) = 1.0 9, > 120° 

= (1/60) x (0, - 60) 120° > 0, > 60° 

= 0.0 0, < 60° 

/(02) = 1.0 0 2 > 120° 

= (1/60) x (0 2 - 60) 120° > 0 2 > 60° 

= 0.0 0 2 < 60° 

Here d 0 = r, + r,, i.e. the van der Waals distance 
between the donor and the acceptor. These functions 
are derived from the analysis of all the potential hy- 
drogen bonding pairs presented in the training set. The 
observed distribution of the donor-acceptor distance 
(d) is shown in Figure 2a. In this figure, one can see 
that the peak value appears around 2.8A, which can 
be interpreted as the ideal length of a hydrogen bond. 
As d increases, the population decreases. But after d 
exceeds 3.4 ~ 3.5 A, it passes the bottom and begins 
to rise again, which can be interpreted as the turn- 
ing point from a hydrogen bond to a normal van der 
Waals contact. Therefore, it is reasonable to define that 
f(d) = 1.0 when d = 2.8 A while f(d) = 0.0 when 
d = 3.5 A. Considering the atomic radii of oxygen and 
nitrogen atoms, 2.8 A corresponds to d 0 - 0.7 A while 
3.5 A corresponds to d 0 , approximately. By assuming 
that the distance dependence of the strength of a hy- 
drogen bond is linear within this range, one will obtain 
the function listed above. The angular functions f(Q\) 
and /(0 2 ) are also derived similarly by interpreting the 
observed distributions of 0] and 9 2 from the training 
set (Figures 2b and 2c). 

The hydrogen bonding interaction between the lig- 
and and the protein is calculated by summing up all 
the hydrogen bonds: 

ligand protein 

HB = 5^ E ™ v (4) 

' i 

All types of hydrogen bonds, i.e. O-O, O-N, and N- 
N, are equally weighted so that no extra parameter is 
necessary. Special attention has been paid to the sat- 
uration in hydrogen bonding if one donor or acceptor 
atom contacts with multiple donor or acceptor atoms. 
For a given donor or acceptor atom, we define that (i) 
the maximal number of hydrogen bonds that a donor 
atom can form should not exceed the number of hy- 
drogen atoms on that donor atom; and (ii) the maximal 
number of hydrogen bonds that an acceptor atom can 
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Figure 2. Distribution of the three geometric parameters in a hy- 
drogen bond observed in the training set: (a) the donor-acceptor 
distance (in angstroms); (b) the DR-D-A angle (in degrees); (c) the 
D-A-AR angle (in degrees). 



form should not exceed the number of lone pairs on 
that acceptor atom. If an atom could be a donor and an 
acceptor at the same time, such as the oxygen atom in 
a hydroxyl group, both rules apply. 

As implied above, charged and neutral hydrogen 
bonds are not treated separately in our scoring func- 
tions since we find that the improvement of our scor- 
ing functions in the training set regression is totally 
negligible by separating them. 

In some cases, metal ions are found inside the bind- 
ing site of the protein. They may form coordinated 
bonds with atoms with lone pairs in the ligand and thus 
also contribute to the binding affinity. We include this 
kind of interaction in the hydrogen bonding term since 
it is the same as hydrogen bonding in nature, i.e. Lewis 
acid-base pair. Note that technically we define metal 
ions as 'donor' so that the metal-ligand coordinated 
bonds are calculated with exactly the same distance 
and angular functions of hydrogen bonding. 

(4) Deformation effect. Upon binding, both the 
ligand and the protein will be constrained in confor- 
mation as compared to their free states in solvent. This 
will raise adverse entropic changes, which is a nega- 
tive effect that must be overcome during the binding 
process. In other empirical scoring functions, the de- 
formation effect of the ligand is often estimated by 
counting the number of rotatable bonds (rotors) that 
become frozen during the binding process, assuming 
that each rotor is associated with a discrete number of 
low-energy conformations and thus a certain amount 
of conformational entropy. If there are more than one 
rotor in the ligand, their contributions are assumed to 
be additive. This assumption is reasonable when all 
the rotors are isolated and free to rotate, so the low- 
energy conformations associated with each rotor will 
multiply to build up the entire conformational space. 
However, when two or more rotors cross, apparently 
this assumption is no longer valid because now the 
rotation of any of them will interfere with the others 
and this will result in a reduction in the total num- 
ber of possible low-energy conformations (Figure 3). 
Therefore, simply counting the number of rotors often 
overestimates the conformational flexibility of certain 
kinds of molecules, such as oligo-peptides. 

In our algorithm, rotor is denned as acyclic sp 3 - 
sp 3 or sp 3 -sp 2 single bond between two non-hydrogen 
atoms. Terminal groups, such as -CH3, -NH2, -OH, 
and -X (X = F, CI, Br, I), whose rotation does not 
produce any new conformation of heavy atoms are not 
counted as rotors. The potential flexibility of cyclic 
portions of the ligand is ignored. The deformation 
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Figure 3. Illustration of 'isolated' (on the left) and 'crossed' rotors 
(on the right). 



effect of the ligand is then expressed as the contribu- 
tion of all the rotors with proper weight factors. For 
the convenience of computation, rotors are counted by 
summing the share of each ligand atom: 

ligand 

RT= £ RT,, (5) 

j 

where RT, = 0 if atom i is not involved in any rotor; 
RT, = 0.5 if atom i is involved in one rotor; and 
RT, = 1 .0 if atom i is involved in two rotors. How- 
ever, if atom i is involved in more than two rotors, 
then RT, = 0.5. Note that, according to the conven- 
tional rotor-counting algorithm, RT, should be 1.5 (in 
three rotors) or 2.0 (in four rotors) in this case. This 
reduction in the RT, value reflects our consideration 
for offsetting the overestimation of conformational 
flexibility in the conventional algorithm. Although 
very crude, we found that our algorithm improves the 
accuracy of our scoring functions. 

In our algorithm, the deformation effect of the 
protein is neglected. We have attempted to count the 
number of rotors presented in the side chains of the 
binding site residues and include it as a term in our 
scoring functions. However, such attempt did not im- 
prove the result. It is not surprising since the side 
chains of the binding site residues are generally immo- 
bilized even in the unbound state due to the stacking 
of neighboring residues. A more reasonable algorithm 
needs to be developed to account for the flexibility of 
the protein. 

(5) Hydrophobic effect. Binding of the ligand and 
the protein is accompanied by the desolvation process 
that undergoes changes in entropy as well as in en- 
thalpy. One of the results is that non-polar groups 
tend to favor each other, which is also referred to as 
'hydrophobic effect'. This effect is very difficult for 
accurate characterization since it involves complicated 
Iigand-water, protein-water, and water-water interac- 
tions before and after binding. Different algorithms 
have been used in other empirical scoring functions 



to calculate this term. We have implemented three 
representative algorithms in our scoring functions. 

(i) Hydrophobic surface algorithm. The hydropho- 
bic effect is assumed to be proportional to the buried 
hydrophobic surface of the ligand (Equation 6). This 
algorithm was adopted by Bohm's scoring function 
[27]. It should be pointed out that technically there are 
several types of molecular surfaces. Here we choose 
to use the solvent-accessible surface (SAS). 

ligand 

HS= ]T SAS,. (6) 

The radius of the solvent probe is set to 1 .5 A. The 
solvent-accessible surface of the ligand is represented 
by evenly distributed dots in a spacing of 0.5 A. Nu- 
merical integration is used to calculate the surface 
area. The surface areas of hydrogen atoms are at- 
tributed to their root atoms. Any part of the ligand 
surface is considered buried if it penetrates into the 
solvent-accessible surface of the protein. Note that 
only hydrophobic atoms are considered in Equation 6. 
The total amount of buried surface area is expressed in 
square Angstrom. 

(ii) Hydrophobic contact algorithm. The hy- 
drophobic effect is calculated by summing up the 
hydrophobic atom pairs formed between the ligand 
and the protein. This algorithm was adopted by Chem- 
Score [30]. In our algorithm, it is calculated as: 

li gan d protein 

HC = E E fM< < 7 > 

where 

/(</) = 1.0 d<d 0 + Q.5k 

= (1/1.5) x (do + 2.0 - d) d 0 + 0.5 A < d 

<d 0 + 2.0 A 
= 0.0 • d > d 0 + 2.0A. 

This distance function reflects the intuition that the 
strength of 'hydrophobic interaction' will reach the 
maximum when two hydrophobic atoms form van der 
Waals contact and diminish gradually with the in- 
crease in the inter-atomic distance. We find that this 
distance function needs to be fairly long-ranged in 
order to work well. 

(iii) Hydrophobic matching algorithm. This algo- 
rithm was adopted by SCORE [32]. According to this 
method, different parts of the ligand sense the pro- 
tein differently because of the heterogeneous nature 
of the binding site. If a hydrophobic ligand atom is 
placed at a hydrophobic site of the protein, then it is 
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expected to be favorable to the binding process. The 
overall hydrophobic matching between the ligand and 
the protein is calculated as: 

ligand 

HM= £ x HM' - W 

Here HM, is an indicator variable. It is set to 1 if a 
hydrophobic atom i is placed in a hydrophobic envi- 
ronment; otherwise it is set to 0. LogP,- refers to the 
hydrophobic scale of atom i, which is the contribution 
of atom i to the n-octanol/water partition coefficient 
(Log P) of the molecule. In our algorithm, the hy- 
drophobic scales for all kinds of atoms are cited from 
XLOGP2 [36]. They are introduced as weight factors 
here to ensure that more hydrophobic atoms contribute 
more to the hydrophobic effect. The 'environment' 
of a given ligand atom is defined to consist of all 
the atoms on the protein which are within 6 A from 
the ligand atom. The hydrophobicity of the environ- 
ment is determined by summing up the hydrophobic 
scales of all its member atoms. Our investigation of 
the training set shows that the average hydrophobicity 
of an environment surrounding a hydrophobic ligand 
atom is -0.50 log/" units. Therefore, in our algo- 
rithm an environment is defined as hydrophobic if its 
hydrophobicity is greater than —0.50 \ogP units. 

Finally, we summarize our scoring functions be- 
low. The binding affinity of a given protein-ligand 
complex, as expressed in pK d unit, is calculated by 
summing up all the terms described above. Since three 
different algorithms for modeling the hydrophobic ef- 
fect have been implemented, we have three resulting 
scoring functions: 

pK d j = Co.] 4-Cvdw.i x VDW 

+ c H-bond,l * 96 

+C r otor,l X RT 

+Chydrophobic.l X HS, (9) 
pK d ,2 = C 0 ,2 + CvDW.2 X VDW 

+CH-bond,2 x HB 
+C ro tor.2 X RT 

+Chydrophobic.2 x HC, (10) 
pK d j = Co.3 + Cvdw.3 x VDW 
+CH - bond,3 x ™ 
+C rot or.3 x RT 

+Chydrophobic,3 X HM. (11) 

It should be emphasized that, except for the hy- 
drophobic effect term, all the other terms in these 
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Figure 4. Correlation between the observed binding affinities of the 
200 protein-ligand complexes in the training set and the fitted values 
given by X-CSCORE (in pK d units). 

three scoring functions are calculated using identical 
algorithms. The consensus scoring function, which is 
named as X-CSCORE, is the arithmetical average of 
Equations 9-11: 

X - CSCORE = (pK dA + pK d , 2 + pK d .i) /3 (12) 



Regression analyses 

Coefficients before each term in Equations 9-11 are 
derived through standard least-square multivariate re- 
gression analyses of the training set. They are listed in 
Table 1 together with other related information. Cor- 
relation coefficients (r 2 ) and standard deviations (s) 
obtained from regression are listed in Table 2. The 
correlation between the observed binding affinities and 
the fitted values given by X-CSCORE is shown in Fig- 
ure 4. Leave-one-out cross-validations are performed 
to judge the quality of the regression models. The re- 
sulting q 1 and s pKSS are listed in Table 2. Both the 
regression and the cross-validation are performed with 
the QSAR module in SYBYL. 

Validation 

(1) Test set. An independent test set is usually needed 
to validate a regression model. When constructing the 
training set, we deliberately separate all the complexes 
released by the Protein Data Bank after 1998 from the 
others. These complexes, 30 in total, are used as a test 
set in our study. A complete list of the test set can 
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Table T. Regression models of Equations 9-1 1 



Term 



Coefficient 2 



Mean value 15 



Contribution 
fraction 0 



(Equation 9) 

VDW 

H-Bond 

Rotor 

Hydrophobic surface 
Constant 

(Equation 10) 

VDW 

H-Bond 

Rotor 

Hydrophobic contact 
Constant 

(Equation 11) 

VDW 

H-Bond 

Rotor 



-2.01 x 10 -3 (± 1.81 x 10 -3 ) 

0.307 (±0.137) 
-0.159 (±0.079) 

7.10xl0 -3 (±2.50 x 10 -3 ) 

2.69 (± 0.66) 



-0.96xl0~ 3 (± 1.91 xlO -3 ) 

0.412 (± 0.149) 
-0.100(± 0.074) 

3.73 x 10 -2 (± 1.12 x 10 -2 ) 
2.78 (± 0.65) 



-2.14 x 10~ 3 (± 1.65 x 10 -3 ) 

0.311 (± 0.131) 
-0.169 (± 0.078) 



Hydrophobic matching 0.602 (± 0. 1 59) 



-6.00 x 10 2 
4.21 
7.28 

2.74 x 10 2 A 2 



-6.00 x 10 2 

4.21 

7.28 
43.1 



-6.00 x 10 2 
4.21 
7.28 
2.51 



16.5% 
19.8 % 
25.3 % 
38.4% 



8.6% 
29.4% 
17.5% 
44.5% 



16.4% 
18.8% 
25.2% 
39.6% 



Constant 



3.10(± 0.65) 



a All coefficients are presented in pK d units. They can be converted into binding free energies at 298 K 
in kcal/mol by multiplying a factor of - 1.36. The values in brackets are 95% confidence intervals in 
regression. 

b Mean values of each term are calculated over the entire training set. 
Contribution fractions are calculated by using the QSAR/PLS module in SYBYL. 



Table 2. Statistical results of Equations 9-1 1 and X-CSCORE 





Equation 9 


Equation 10 


Equation 1 1 


X-CSCORE 


* 2 


0.504 


0.546 


0.571 


0.591 




1.58 


1.53 


1.43 


1.47 


f (4. 195) 


49.6 


58.7 


70.4 




Q 2 


0.480 


0.522 


0.551 




•Spress 


1.62 


1.57 


1.47 




"prcd 


0.318 


0.319 


0.249 


0.356 


■Spred 


1.51 


1.61 


1.63 


1.58 



a All the standard deviations, including 5, S preS s and 5 pre( j, are 
presented in pKj units. They can be converted into binding free 
energies at 298 K in kcal/mol by multiplying a factor of - 1.36. 



be found in the supplementary material section in this 
paper. 

All the scoring functions, including Equations 9- 
12, are used to predict the binding affinities of the 
30 protein-ligand complexes in the test set. The root- 



mean-squared deviation (s pre d) is used to measure the 
quality of prediction: 

fpred = {P K d,p re d ~ pK di0bs f / (N - 1). (13) 

The statistical results are shown in Table 2. The 
correlation between the experimentally observed bind- 
ing affinities and the predicted values given by X- 
CSCORE is shown in Figure 5. 

(2) Evolutionary regression. We have adopted an 
iterative regression procedure to further validate the 
internal consistency of our scoring functions, which 
was originally proposed in our previous work SCORE 
[32]. The central idea of this procedure, called evolu- 
tionary regression, is to test a given regression model 
with training sets of different sizes. In our study, 
this procedure starts from constructing a subset of 
50 complexes which are randomly selected from the 
training set without duplication. This subset is used 
to perform multivariate regression and leave-one-out 
cross-validation for the scoring function under inves- 
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Figure 5. Correlation between the observed binding affinities of the 
30 protein-ligand complexes in the test set and the predicted values 
given by X-CSCORE (in pK d units). 



tigation. All the regression results, including r 2 , s, q 2 , 
■Vess. an( * tne coefficients for each term in the scor- 
ing function, are recorded. This regression model is 
then used to predict the K4 values of the test set. The 
resulting r 2 ied and s pie< i are also recorded. Since the 
subset is constructed randomly, the entire procedure, 
i.e. construction of the subset, multivariate regression, 
cross-validation, and calculation of the test set, is re- 
peated for 10 times to reduce the noises in all the 
statistical results. Only the averaged results are used 
for analysis. At the next step, the size of the sub- 
set is increased by 10, and the regression model is 
re-evaluated with this new subset. This procedure is re- 
peated until the size of the subset reached the full size 
of the training set. We have performed evolutionary re- 
gression for Equations 9-1 1. The standard deviations 
observed during the evolutionary regression proce- 
dure of Equations 9-11 are shown in Figure 6a-c, 
respectively. 

(3) Molecular docking. We have also tested the 
performance of X-CSCORE in molecular docking ex- 
periments. We select 10 samples from the training set, 
including the L-arabinose binding protein/L-arabinose 
complex (PDB code 1ABE), the alcohol dehydroge- 
nase/CNAD complex (PDB code 1ADB), the adeno- 
sine deaminase/DAA complex (PDB code 1ADD), 
the cytidine deaminase/uridine complex (PDB code 
1AF2), the maltodextrin binding protein/maltose 
complex (PDB code 1ANF), the carboxypeptidase 
A/L-benzylsuccinate complex (PDB code 1CBX), 
the antibody DB3/progesterone analogue complex 
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Figure 6. Standard deviations (in pKj units) observed in the evo- 
lutionary regression procedure, (a) Equation 9; (b) Equation 10; (c) 
Equation 11. 
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Figure 7. Relationship between the RMSD values (A) and the bind- 
ing energies (kcal/mol) of 100 conformations of L-benzylsuccinate 
in complex with carboxypeptidase A (PDB code 1CBX). (a) Bind- 
ing energies calculated by AutoDock. (b) Binding energies calcu- 
lated by X-CSCORE. 
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Figure 8. Relationship between the RMSD values (A) and the bind- 
ing energies (kcal/mol) of 100 conformations of folate in complex 
with dihydrofolate reductase (PDB code 1DHF). (a) Binding ener- 
gies calculated by AutoDock. (b) Binding energies calculated by 
X-CSCORE. 



(PDB code 1DBM), the dihydrofolate reductase/folate 
complex (PDB code 1DHF), the glutathione S- 
transferase/glutathione complex (PDB code 1GST), 
and the HIV-1 protease/VX-478 complex (PDB code 
1HPV). The selection of these 10 samples emphasizes 
the diversity of the ligands and the proteins. For each 
complex, the AutoDock 3.0 program [8] is employed 
to perform a molecular docking run. In each case, 
the experimentally determined complex structure is al- 



ways used as the starting point. The ligand is treated 
flexible while the protein is kept rigid. The searching 
steps in the conformational sampling for translation, 
quaternion, and torsion are set to 0.5 A, 15° and 15°, 
respectively. Fifty thousand genetic algorithm genera- 
tions are run with a population of 100 conformations. 
The final 100 best-scored conformations are saved 
and their root-mean-squared deviations (RMSD), as 
calculated by using the observed bound conforma- 
tion as the reference, are recorded. Then the binding 
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Program description 

We have developed a program, X-CSCORE, to imple- 
ment the three scoring functions described by Equa- 
tions 9-11. Here 'CSCORE' means consensus scor- 
ing; while the prefix 'X' indicates that it is part of our 
in-house drug design toolkit X-TOOL. This program 
is written in ANSI C++ and has been tested on UNIX 
and LINUX platforms. The required inputs include the 
three-dimensional structure of the protein in PDB for- 
mat and the pre-docked ligand molecule(s) in MOL2 
format. The user is allowed to enable or disable any 
of the three scoring functions in computation and the 
final predicted binding affinities are based on the arith- 
metic average of all the enabled scoring functions. If 
all three scoring functions are enabled, typically this 
program is able to process around 10 000 ligand mole- 
cules for a given protein target in an hour on a SGI 
O2/R5000/1 80MHz workstation. 
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Figure 9. Relationship between the RMSD values (A) and the bind- 
ing energies (kcal/mol) of 100 conformations of 1 -deaza-adenosine 
in complex with adenosine deaminase (PDB code 1ADD). (a) 
Binding energies calculated by AutoDock. (b) Binding energies 
calculated by X-CSCORE. 

energies of these conformations are re-calculated by 
X-CSCORE. RMSD values of the best-scored confor- 
mations, picked by AutoDock and X-CSCORE, of all 
10 complexes are summarized in Table 3. For com- 
plex 1CBX, 1DHF, and 1ADD, RMSD values of the 
final 100 conformations are plotted against the bind- 
ing energies of these conformations in Figures 7-9, 
respectively. 



Discussion 

Accuracy and robustness 

As shown in Table 2, Equations 9-1 1 are able to re- 
produce the binding affinities of the entire training set 
with standard deviations (s) of 1.58, 1.53 and 1.43 pKj 
units, respectively. Their standard deviations in leave- 
one-out cross-validation (s P ress) are at the same level, 
which are 1.62, 1.57 and 1.47 pKd units, respectively. 
More importantly, these scoring functions perform al- 
most equally well for the independent test set: the 
standard deviations in the predicted binding affinities 
(jpred) are 1.51, 1.61 and 1.63 pKj units, respec- 
tively. These values correspond to 2.1-2.2 kcal/mol in 
binding free energy at room temperature. Considering 
the diversity of the test set, such accuracy in bind- 
ing affinity prediction is encouraging. Compared to 
other existing empirical scoring functions, our scoring 
functions have achieved better or comparable statisti- 
cal results. If taking the Fisher significant ratio (F) as 
an objective criterion for comparing different regres- 
sion models, the values are 32.1, 44.5 and 57.8 for 
Bohm's scoring function [27], ChemScore [30], and 
SCORE [32], respectively. In comparison, the F val- 
ues of our scoring functions are 49.6, 58.7 and 70.4 for 
Equations 9-11, respectively. 

When building a regression model, over-fitting of 
the regression equation should be avoided because it 
may fail to give reasonable predictions for samples 
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Table 3. Results from the molecular docking studies of 10 protein-ligand complexes 



PDB code 


Resolution 

(A) 


RMSD (A) a 








AutoDock 


X-CSCORE 


Exp. b 


X-CSCORE c 


1 A DC 


1 7 


U.Oi 




0.3£ 


5. 14 (3. ZD) 


1 ADD 


LA 


0 1A 


") 1A 




O./l (8.01) 


i a r\r\ 




*> Q1 
CMS 


U.OJ 


A "7/1 


5.0.5 (5.30) 


1AF2 


2.3 


0.88 


0.88 


3.10 


5.26 (4.90) 


1ANF 


1.67 


0.60 


0.54 


5.46 


6.16(6.03) 


1CBX 


2.0 


2.30 


0.77 


6.35 


5.74 (5.74) 


IDBM 


2.7 


1.31 


1.13 


9.44 


6.84 (6.65) 


1DHF 


2.3 


6.44 


1.24 


7.40 


6.34 (5.27) 


1GST 


2.2 


0.74 


1.21 


4.68 


5.92(5.21) 


1HPV 


1.9 


1.73 


1.16 


9.22 


6.47 (6.28) 



a RMSD value of the best scored conformation in reference to the observed bound 
conformation. 

b Experimentally determined binding affinity. 

'Calculated binding affinity for the best scored conformation. The values in brackets 
are the calculated one for the observed bound conformation. 



outside the training set. For this reason, only a min- 
imal number of adjustable parameters are included in 
our scoring functions to achieve maximal N/M ratio 
in regression analysis. For example, we do not as- 
sign additional weighting factors to different types of 
atoms when calculating the van der Waals interaction. 
When calculating the hydrogen bonding, we do not 
differentiate charged and neutral hydrogen bonds. No 
differentiation in aliphatic and aromatic atoms was 
made in calculating the hydrophobic effect. Besides 
the regression constant, there are only four coefficients 
in each of our scoring functions. As shown in TaWe 1, 
they are all significant in regression analysis. Here the 
van der Waals interaction term in Equation 10 seems 
to be an exception, which contributes only a relative 
small fraction. However, it is not surprising since the 
hydrophobic effect term in Equation 10 is also cal- 
culated by counting atom pairs, therefore it overlaps 
with the van der Waals term partially and 'grabs' some 
contributions from the van der Waals term. 

The N/M ratio issue deserves a little more dis- 
cussion. It is reasonable to expect that statistically 
converged results can only be obtained by using a large 
training set. But how large is large? What is the proper 
size of the training set for deriving an empirical scor- 
ing function like ours? To answer this question, we 
have adopted the evolutionary regression procedure to 
look for the answer. The idea of evolutionary regres- 
sion is to test a given regression model with training 
sets in different sizes and monitor the quality of the 
regression model during this procedure. Several trends 



can be seen in the evolutionary regression experiments 
of Equations 9-11 (Figure 6). (i) The standard devi- 
ation in the whole set fitting (s) gradually increases 
when the training set grows larger. This can be under- 
stood because the scoring function under regression is 
kept fixed during the whole procedure. A larger train- 
ing set represents more complexity and thus is more 
difficult to reconcile, (ii) The predictive ability of the 
regression model, as indicated by the standard devia- 
tions in leave-one-out cross-validation (s pTess ) and test 
set computation (i pr ed), is gradually improved when 
the training set grows larger. This indicates that a 
larger training set indeed helps our scoring functions 
achieve better predictive ability, (iii) When the training 
set is relatively small, the regression model is gen- 
erally unstable. The final regression model depends 
very sensitively on the contents of the training set, 
which may lead to chance correlation in regression and 
poor predictive ability. When the training set grows 
larger, the regression model becomes more stable and 
tends to converge to a certain level. As suggested by 
our evolutionary regression experiments, a training set 
containing at least 160 samples is required to derive a 
stable empirical scoring function with four terms, i.e. 
a minimal N/M ratio of 40. Unfortunately, the N/M 
ratios of other existing empirical scoring functions are 
generally much lower than this, e.g. LUDI (N/M = 
45/6 = 9) [27], ChemScore (N/M = 82/4 = 20) [30], 
and SCORE (N/M = 170/10 = 17) [32]. In our case, 
the N/M ratio is 200/4 = 50. Therefore, we believe our 
scoring functions are, if not much more accurate, more 
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robust in binding affinity prediction for a wider range 
of protein-ligand complexes. 

Consensus scoring 

A unique feature of our study is that three different al- 
gorithms have been implemented for modeling the hy- 
drophobic effect. As described in the Methods section, 
hydrophobic effect is calculated either by the buried 
solvent-accessible molecular surface (Equation 9), or 
by the number of hydrophobic contacts between the 
protein and the ligand (Equation 10), or by the hy- 
drophobic matching of the ligand with the binding site 
(Equation 11). All three algorithms are conceptually 
acceptable and actually they represent three typical al- 
gorithms adopted by empirical scoring functions for 
modeling the hydrophobic effect. However, it is not 
a good idea to include all three terms together in one 
scoring function since they account for the same effect 
and thus are highly correlated to each other. Therefore, 
they have to be accommodated in three scoring func- 
tions. As indicated by our regression results (Table 
2), all three scoring functions perform reasonably well 
and are basically comparable to each other. However, 
since these three algorithms utilize different geometric 
features of the given protein-ligand complex structure 
in computation, their results differ. We have found 
that, for 40.0% of the samples in the training set, the 
difference between the lowest and the highest calcu- 
lated binding affinity by these three scoring functions 
is less than 0.50 pKa units; for 40.5% of the samples, 
the difference is between 0.50 and 1.00 pKa units; 
while for the remaining 19.5% of the samples, the dif- 
ference is larger than 1 .00 pKj units. One can see that 
such difference is not trivial at all in many cases. Con- 
ceivably, if one can predict which scoring function will 
be the best for a given protein-ligand complex, the ac- 
curacy in binding affinity prediction will be improved 
greatly. Indeed, if the experimental values are corre- 
lated to the best fitted values (each of them is chosen 
from three hits), the standard deviation in the training 
set fitting will drop by half to about 0.7 pKj units. We 
have attempted to find out which scoring function may 
perform better for certain classes of ligands or families 
of proteins. Unfortunately this attempt ended without 
much success. 

Based on the fact that there is no reason to bias 
towards any one of the three scoring functions, we 
simply combine them together (Equation 12). This 
practice is consistent with the idea of consensus scor- 
ing which has been demonstrated to be an effective 



way of improving the hit-rates in virtual database 
screening [33]. As shown in Table 2, the perfor- 
mance of a single scoring function may vary and is 
not predictable. For example, among the three scoring 
functions, Equation 9 is the worst one for the train- 
ing set but the best one for the test set. In contrast, 
Equation 1 1 is the best one for the training set but the 
worst one for the test set. By averaging these scoring 
functions, i.e. X-CSCORE, the result is not always 
the closest one to the true value (in fact it is always 
between the best one and the worst one). However, 
the advantages are: (i) it provides a clear indication 
of what level of accuracy these three scoring functions 
can achieve. Obtaining a converged result in binding 
affinity prediction is certainly important for structure- 
based drug design practice; and (ii) large errors in 
binding affinity prediction can be reduced. Recently 
we have pointed out that the nature of consensus scor- 
ing is multiple sampling [37]. By applying multiple 
scoring functions in combination, the positive and the 
negative errors have a chance to cancel each other 
and that is why consensus scoring generally performs 
better than any single scoring procedure. 

Application to molecular docking 

Our scoring function is developed primarily for esti- 
mating the binding affinity of a given complex with 
known structure ('scoring'). We also expect it to be 
useful for identifying the correct 'pose' of a ligand 
to its receptor ('docking'). Although some disputes 
still exist in whether 'docking' or 'scoring' should use 
the same type of function, we believe that ideally a 
'scoring' function should also be able to serve as a 
'docking' function. This is very important because in 
practice 'docking' and 'scoring' are often inseparable, 
such as in a virtual database screening study. 

As described in the Methods section, we have in- 
vestigated the potential application of X-CSCORE in 
molecular docking with 10 samples. Since we have 
not implemented this consensus scoring function into 
any molecular docking program directly, we employ 
the AutoDock program as a tool to generate possi- 
ble bound conformations of the given ligand. All the 
conformations are then re-evaluated by X-CSCORE. 
RMSD values of the best scored conformations of 
these 10 protein-ligand complexes are listed in Ta- 
ble 3, where the results of X-CSCORE and the force 
field calculation in AutoDock are compared side by 
side. As one can see, if using the force field cal- 
culation in AutoDock as the scoring engine, 4 out 
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of the total 10 samples have RMSD values larger 
than 2.0 A; while if using X-CSCORE as the scoring 
engine, only one sample, i.e. the alcohol dehydro- 
genase/CNAD complex (PDB code 1ADB), shows 
a RMSD value larger than 2.0 A. In this case, we 
have checked all the 100 conformations generated by 
AutoDock and we found that the lowest RMSD value 
is* 2.74 A. This indicates that, with the parameters 
we were using, AutoDock has not generated any con- 
formation close enough to the observed one. In fact, 
X-CSCORE predicts a much higher pKj value of 
8.01 for the observed one. The RMSD versus en- 
ergy relationships observed in our docking tests for 
the Carboxypeptidase A/L-benzylsuccinate complex 
(PDB code 1CBX), the Dihydrofolate reductase/folate 
complex (PDB code 1DHF), and the adenosine deam- 
inase/DAA complex (PDB code 1ADD) are shown in 
Figures 7-9, respectively. For these three samples, the 
best RMSD values given by AutoDock are 2.30 A, 
6.44 A and 2.93 A; while the corresponding ones given 
by X-CSCORE are 0.77 A, 1.24 A and 0.83 A. It is 
very interesting to notice that, in the case of 1DHF, 
AutoDock has apparently chosen a wrong class of con- 
formations while the correct one is somehow scored 
about 2 kcal/mol higher. In contrast, X-CSCORE has 
no problem in identifying the correct conformation. 

It is very encouraging that our scoring functions 
are also applicable to molecular docking. Our scor- 
ing functions have all the necessary elements that 
correspond to the non-covalent interactions in a con- 
ventional force field, such as the van der Waals in- 
teraction and the electrostatic interaction (replaced by 
the hydrogen bonding term in our scoring functions). 
Besides that, our scoring functions also consider the 
hydrophobic effect and thus provide a better estima- 
tion of binding free energies. This is suggested in 
Figures 7b, 8b and 9b. In these cases, there is al- 
ways a clear correlation between the RMSD values 
of the conformations and their binding energies cal- 
culated by X-CSCORE. Generally, the smaller is the 
RMSD value, the lower is the binding energy. The im- 
portance of this feature should not be underestimated. 
Molecular docking is a conformational sampling pro- 
cedure which is performed on the potential energy 
surface defined by a certain scoring function. It is 
important that this potential energy surface does not 
contain a large number of false minima since such 
frustration will probably lead to poor convergence or 
wrong binding modes. The potential energy surface 
defined by an ideal scoring function should shape like 
a funnel, on which all the paths finally go down to 



the right position. As indicated by the RMSD ver- 
sus energy relationships shown in Figures 7b, 8b and 
9b, our consensus scoring function may have such an 
appealing feature. We expect that if a molecular dock- 
ing program adopts our consensus scoring function 
as its scoring engine, its accuracy and efficiency in 
finding the correct bound structure will be improved 
considerably. 

Considering that in practice our consensus scoring 
function will be applied in conjunction with molec- 
ular docking programs, it is highly desirable that all 
our scoring functions are able to tolerate at least a 
small amount of uncertainty in the input structure. For 
this reason, we have designed our scoring functions in 
such a way that they are not too sensitive to atomic 
coordinates. For example, we avoid the explicit use 
of hydrogen atoms in our algorithms. The reason is 
that predicting the position of a hydrogen atom pre- 
cisely could be problematic when the hydrogen atom 
is bonded to a terminal rotatable group, such as a 
hydroxyl group. This uncertainty will lead to large 
deviation if hydrogen atoms have to be included ex- 
plicitly in the calculation. Secondly, all the terms in 
our scoring functions are calculated with relatively 
large tolerances. For example, a 'softer' 8-4 equa- 
tion is adopted in the van der Waals interaction term; 
loose criteria for distance and angular dependence are 
adopted in the hydrogen bonding term; long-distance 
cutoff is adopted in the hydrophobic effect terms. All 
these efforts are dedicated to emphasize on the overall 
fitness of the ligand to the binding site rather than triv- 
ial structural details. As shown in Table 3, by applying 
X-CSCORE, if a conformation is close to the reference 
conformation, then indeed it will get a score close to 
the one of the reference conformation. 

Strength and weakness 

Our scoring functions are developed to provide fast 
binding affinity estimations for a wide range of pro- 
teins and ligands. As demonstrated by the training set 
and the test set, the average accuracy of our consen- 
sus scoring function in calculating absolute binding 
free energies is approximately 2 kcal/mol. This level 
of accuracy is acceptable for structure-based lead dis- 
covery in which very accurate prediction of binding 
free energies may not be necessary, such as virtual 
database screening or de novo structure generation. 
The speed of our consensus scoring function is also 
perfectly suitable for such approaches. 
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We have implemented our scoring functions in a 
user-friendly program and have already applied it to 
several on-going structure-based drug design projects 
in our group. In these projects, large chemical data- 
bases are screened first by a standard docking pro- 
gram, such as DOCK, to pick out the top 10% 
compounds. These compounds are then re-evaluated 
by X-CSCORE. The best compounds selected by 
X-CSCORE, usually less than 0.1% of the original 
database, are then tested in biological assays. Very 
promising compounds have been identified since the 
application of this approach. 

However, the accuracy of our consensus scoring 
function in binding affinity prediction is still not to- 
tally satisfactory: an error of 2 kcal/mol in binding 
free energy equals to approximately 50 folds in dis- 
sociation constant. Several drawbacks in our approach 
may have contributed to this inaccuracy. Firstly, since 
our scoring functions are derived from regression, they 
tend to characterize only the "common" interactions 
that are exhibited by a large population in the training 
set. Some other types of interactions, such as cation- 
tz interaction and tt-ti stacking, are not included in 
our scoring functions simply because they have rare 
occurrences and thus do not contribute much to the re- 
gression model. It is thus expected that a general-type 
scoring function like ours could fail to give reason- 
able predictions when these types of interactions are 
playing an important role in protein-ligand binding. 
Secondly, there are also some factors which are com- 
mon but we do not really have reasonable methods 
to take them into account. One example is the 'wa- 
ter molecules existing on the protein-ligand interface. 
Such water molecules are quite common and in some 
cases are thought to play an important role in the 
ligand binding. However, it remains unclear how to 
consider water molecules explicitly with an empiri- 
cal scoring function. If water molecules need to be 
considered explicitly, maybe the entire algorithm for 
modeling the so-called 'hydrophobic effect' needs be 
replaced as well. 

Our scoring functions also tend to give large posi- 
tive errors for complexes with very low affinities and 
large negative errors for complexes with very high 
affinities (Figure 4). This phenomenon contributes to 
the significant positive intercept (~ 3 pKj units) ob- 
served in all three scoring functions. Given the fact 
that most of the samples in the training set (80%) have 
pKj values between 3.00 and 9.00, our scoring func- 
tions are calibrated better for binding affinities at this 
range. In fact, if only the samples within this affinity 



range are chosen to derive our scoring functions, the 
standard deviations in regression will drop to 1.2-1.3 
pKd units (~ 1.7 kcal/mol in binding free energy). 

Another major problem is the quality of the train- 
ing set. Ideally, each protein-ligand complex in the 
training set should have a known high-resolution 
three-dimensional structure together with a reliably 
measured binding affinity value accessible to the pub- 
lic. Obtaining protein-ligand complex structures is not 
a problem since the Protein Data Bank provides an 
excellent resource for such information. However, col- 
lecting the binding affinities for these complexes is a 
tedious job since they all scatter in various literatures. 
So far, no appreciable database for such information 
has been established. The training set used in our 
study is a compilation of the training sets published in 
others' work plus our own collections from the litera- 
ture. Containing 200 samples, it is already the largest 
set published to date in an empirical scoring func- 
tion approach. As demonstrated in our evolutionary 
regression test, the size of this training set is suffi- 
cient for calibrating our scoring functions. However, 
the binding affinity data presented in this training set 
still need careful examination because a large portion 
of them are cited directly from others' work without 
further confirmation. Besides, some of the dissociation 
constants could have been measured under different 
experimental conditions, such as PH level, tempera- 
ture, and salt concentration. The uncertainties in the 
binding affinity data have certainly placed an intrinsic 
limit on the accuracy of our scoring functions. 

It should be mentioned that all the drawbacks we 
have discussed above are shared by other empirical 
scoring functions as well. Despite of all these draw- 
backs, empirical scoring functions remain a valuable 
and indispensable means for structure-based drug de- 
sign. Constructing a better training set will not be a 
problem in the future because more and more struc- 
tural and binding affinity data are becoming available. 
We are also optimistic that better algorithms will ap- 
pear to account for the binding process. All these 
efforts will lead to a substantial improvement in the 
performance of future empirical scoring functions. 

Conclusion 

We have developed a consensus empirical scoring 
function, X-CSCORE, for estimating the binding 
affinity of a given protein-ligand complex with a 
known three-dimensional structure. The framework 
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of our study is very similar to Bohm's pioneering 
work. However, we have presented our works on de- 
signing better algorithms for the contributing terms 
and calibrating the scoring functions against a larger 
training set. As shown in this paper, our consensus 
scoring function is able to predict the binding free 
energies with an average accuracy of approximately 
2"kcal/mol. Its potential application to molecular dock- 
ing is demonstrated with a number of protein-ligand 
complexes. When compared to the conventional force 
field calculation, X-CSCORE performs considerably 
better in identifying the correct bound conformations. 
Considering the reasonable accuracy, the wide ap- 
plicability, and the respectable speed, we expect that 
X-CSCORE will become a valuable tool for structure- 
based drug design. 

Supplementary material 

Tables of the training set (200 protein-ligand com- 
plexes) and the test set (30 protein-ligand complexes). 
The program, X-CSCORE, is available by contacting 
the authors. 
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Background: The trophozoite stage of the malaria parasite 
infects red blood cells. During this phase of their lifecyde, the 
parasites use hemoglobin as their principal source of amino 
.icids, using a .cysteine protease to degrade it We have previ- 
ously reported a three-dimensional model of this cysteine 
protease, based on the structures' of homologous proteases, 
and the use of the program DOCK to identify a ligand for the 
malaria protease. 

Results: Here we describe the design of improved ligands 
starting from this lead. Ligand design was based on the 
predicted configuration of the lead compound docked to the 
model three-dimensional structure of the protease. The lead 
compound has an IC 50 of 6 jiM, and our design/synthesis 



strategy has resulted in increasingly potent derivatives that 
block the ability of the parasites to infect and/or mature in 
red blood cells. The two best derivatives to date have IC^s of 
4S0nMandl50nM. 

Conclusions: A new class of anti-malaria) chemothcrapeutics 
has resulted from a computational search that was based on a 
model of the target protease. Despite the lack of a detailed 
experimental structure of the target enzyme or the 
enzyme-inhibitor complex, we have been able to identify 
compounds with increased potency. These compounds 
approach (he activity of cliloioquiiie (IC^ = 20 nM), but have 
a distinct meclunism of action. This series of compounds could 
thus lead to new therapies for chlnroquine-resistant malaria. 
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Introduction 

The World Health Organization estimates that 280 
million people are infected with malaria yearly [1]. 
Although various classes of anti- malarial agents exist, the 
most valuable are the quinoline-derived compounds, 
such as chloroquinc and mefloquine. Chloroquine has 
been an especially effective drug in prophylactic and 
therapeutic settings, and thus the appearance of malaria 
strains resistant to chloroquine, and more recently, to 
mefloquine, poses a serious threat to travelers and to 
people in countries where malaria is endemic. Reports 
of multi-drug resistant strains of these parasites make the 
search for novel therapies especially urgent [2], 

In this work, a novel target for anti-malarial chemo- 
therapy is pursued. During the erythrocytic stage of their 
lifecycle, malaria parasites infect circulating red blood 
cells (RBCs) and use hemoglobin as their principal 
source of amino acids. Rosenthal and coworkers have 
identified a cysteine protease from malaria parasites that 
mediates host hemoglobin degradation [3]. Blocking this 
enzyme with cysteine protease inhibitors, such as l-trans- 
epoxysuccinylleucyl amido-(4-guanidino)butane (E-64) 



and peptide fluoromethyl ketones, arrests parasite growth 
in vitro [4] and cures murine malaria (Plasmodium vinckei) 
infection in vivo [5]. These results support the therapeutic 
rationale behind using the protease as a target. Inhibiting 
this protease may, either by itself or in conjunction with 
established therapies, offer an alternative treatment for 
malaria. Most cysteine protease inhibitors, however, arc 
sulfhydryl reagents that covalendy modify the enzyme 
active site, and thus may inactivate some host cysteine 
proteases. Noncovalent angiotensin-converting enzyme 
inhibitors have been successful in the treatment of hyper- 
tension [6], and the clinical success of these compounds 
has motivated us to develop a new class of compounds 
that inhibit the malarial cysteine protease through non- 
covalent interactions. 



Results and discussion 

Initial lead discovery 

The discovery of a nonpeptidic lead compound, oxalic 
6«{(2-hydroxy-l-iiaphthylinethylcne)hydrazide) (ZLI48A), 
for the malaria cysteine protease has previously been 
described f7]. Briefly, a model structure for the malaria 
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Fig. 1. The active site of the malarial protease and putative 
binding mode of the original lead compound, ZLI48A. The 
protease is shown in white and the ligand in gray. Oxygen and 
nitrogen atoms near the active site are colored in red and dark 
blue, respectively. The S 2 , S,, and S,' binding subsites are shown 
in green, yellow, and cyan, respectively. 

cysteine protease was proposed using the X-ray structures 
of papain and actinidin as a basis for modeling. The 
model structure was then used to search the Fine 
Chemicals Directory of commercially available small 
molecules {the Fine Chemicals Directory distributed by 
Molecular Design is currently known as the Available 
Chemical Directory) for putative ligands using the 
docking program, DOCK 3.0 [8]. Thirty-one 
compounds were finally tested and ZLI48A was identi- 
fied as the best inhibitor of the protease. The IC^ value 
for enzyme inhibition against the substrate benzyloxy- 
carbonyl-Phe-Arg-(7-amino-4-methylcoumarin) was 
6 uM [7). More importantly, this compound inhibits the 
growth of parasites in culture, as judged by its ability to 
block hypoxanthine uptake by malaria parasites, with an 
apparent IC S0 value of 7 uM. 

Circulating RBCs are a major target for malaria 
infection, and the magnitude of the infection can be 
quantified by counting the fraction of infected RBCs on 
a chick blood smear. This process is sufficiently laborious 
that it precludes the evaluation of dose-response curves 
for potential therapeutics in an acceptable time frame. 
To overcome this problem, wc have developed a high- 
throughput assay that efficiently counts parasitized 
RBCs. Normally, circulating RBCj are enucleate and 
hence lack DNA. Parasitized RBCs contain malarial 
DNA. Thus. RBC staining by the DNA-binding dye 
propidium iodide can be exploited in an assay tliat uses a 



fluorescence-activated cell sorcer (FACS) to count the 
fraction of infected RBCs in a rapid and automated fash- 
ion. In this assay, ZLI48A has an IC 5(I value of 4.5 uM. 

Berger and Schechter [9] first demonstrated that papain 
like cysteine proteases contain active sites that can 
accommodate up to seven amino acids. For notations! 
convenience, the amino acid residues on the acyl side of 
thcscissile bond are denoted P,, P 2 . . . P n and those on 
the leaving group side are labeled P,', P 9 " . . . P '. The 
corresponding binding sites on the enzyme are S,, S, . . . 
S n and Sj', S 2 ' . . . S n '. The seven-residue binding pocket 
of papain-likc cysteine proteases involves the S,, through 
Sj' subsites. The S 2 and S ( subsites are those most 
responsible for the peptide cleavage specificity of this 
class of enzymes. The most stable conformer of ZLI48A 
is symmetric about its midpoint, with a second axis of 
pseudosymmetry about the backbone. Thus, there ate 
essentially two ways to orient the compound in the 
active site. In both cases, the compound lies across the 
active site cleft of the malaria cysteine protease with one 
naphthyl group fitting into the S 2 pocket and with the 
other stacking with the indole ring of Trpl77 in the S,' 
pocket. The presumed binding mode is shown in Fig. 1 
This orientation was chosen as the working model for 
the orientation in the complex, because it maximizes the 
compound's interaction with the enzyme, with each 
hydroxy! group within hydrogen bonding distance of ;i 
suitable residue in the enzyme (Serl60 and Gin 19), 
Alternatively, the compound could be rotated 180 
degrees about the horizontal axis of pseudosymmetry. In 
this orientation, the hydroxyl groups seem to interact 
with the enzyme less effectively but might interact wuii 
solvent water molecules. 

Lead optimization 

The starting point for lead optimization was the 
protease-ZL148A complex generated by the program 
DOCK 3.0 [10]. Kuntz and colleagues developed the 
DOCK algorithm to capture the static geometric features 
of a molecular recognition site. In the malarial protease 
work, the homology-based model of the enzyme provides 
the template, and the DOCK algorithm identifies a set of 
spheres with approximately atoin-sized radii to till the 
active site cleft. Frequendy, many overlapping spheres are 
used to fill the cleft, and a clustering algorithm is used to 
reduce the complexity of the. sphere representation. 
Preliminary binding modes for a compound are defined 
through attempts to match the centers of cleft spheres 
with the centers of atoms within the compound of" 
interest. Within DOCK 3.0, the quality of a compound's 
fit to the binding cleft can be evaluated based on its shape 
complementarity (contact score) or molecular mechanics 
interaction energy (AMBER force field score). When 
searching a database of compounds, DOCK 3.0 examines 
only the best orientation of the small molecule within the 
binding cleft (DOCK database screening mode). When a 
single compound is studied, multiple possihle binding 
modes can be examined (DOCK single mode). Of course, 
the initial orientation of the compound is dictated, in part. 
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by the irregular lattice of sphere centers identified origi- 
nally. To overcome some of the scoring distortion that this 
bias could impart, a rigid body minimization algorithm 
has been developed to move the ligand modest distances 
within the binding cleft and optimize the interaction 
energies. The use of a rigid body minimization algorithm 
reflects the trade-off becween the need for capid evaluation 
and the reality that many ligands are flexible. ZLI48A was 
selected based on its score for shape complementarity. 
Rigid body minimization was applied to the ligand posi- 
tioned in a preliminary orientation in the active site cleft 
so mar. other contributory factors (van der Waals and elec- 
trostatic interactions) required for efficient binding could 
be reflected in the enzyme-ligand complex (see Fig. 2). 

The energetics of the enzyme-ligand assembly depends 
heavily upon the detailed conformation of the complex. 
Unfortunately, errors in the precise geometry of the 
complex are likely given the computational origins of 
the model. Potential problems could result from the fact 
that gross conformational changes can occur between the 
complexed and uncomplcxed enzyme structures. In 
papain, the active site cleft widens upon binding a 
peptide-like ligand [11]. In the HIV protease, the flaps 
around the active site close on the substrate with 
backbone movements as large as 7 A [12,13). Ligands arc 
also able to alter their conformations upon binding to 
die enzyme. For example, the structures of methotrexate 
.done and complexed to dihydrofolate reductase are dra- 
matically different [14]. Consequently, choosing the 
relevant conformations of both enzyme and ligand can 
be critical for successful inhibitor design. Unfortunately, 
such structural changes are difficult to anticipate a priori 
and so our approach to ligand design is guided by, rather 
than dependent solely upon, the model of the 
ligand-enzy'me structure. 

Despite these uncertainties, a reasonable structure of the 
enzyme and/or enzyme-ligand complex has proven to 
be very useful. In this project, the malarial cysteine 
protease was modeled by homology to the known struc- 
tures of papain and actinidin. The malarial protease has 
33 % sequence identity with both enzymes. The antici- 
pated root mean-square errors average approximately 
1.5 A for the core of the molecule [15]. Fortunately, the 
errors within the active site should be smaller, as it is the 
most conserved portion of the structure. The ligand 
conformation was calculated using CONCORD, an 
algorithm that relies on heuristic rules for translating 
chemical connectivities into three-dimensional co- 
ordinates [16]. This could create errors in the models of 
the ligand conformation. Because of the uncertainties 
inherent in the models of both the enzyme and the 
ligands, sophisticated energy calculations such as free 
energy perturbation are not warranted. The inter- 
pretation of these results would be difficult at best. 

The model of the protease— ZLI48A complex has aryl 
groups filling the S ( ' and S 2 ' pockets, but the pocket is 
only partially filled, hi addition, although the conjugated 



eight-atom backbone of ZLI48A is most likely to be 
found in the all anti-conformation, syn isomers are 
possible that should not fit into the protease active site. 
We therefore designed analogs of ZL148A that have a less 
flexible backbone, Fdl the S 2 and S| or S 2 , S,, and S ( ' 
subsites, and effectively interact with the side chains lining 
the subsite specificity pockets. Because malaria is endemic 
principally in developing countries, useful anti-malarial 
agents must be inexpensive to produce. We thus favored 
analogs that are easily synthesized from relatively inexpen- 
sive starting materials in a small number of steps. The 
basic strategy for lead optimization is outlined in Fig, 3. 

Starting with the rigid-body minimized complex of 
enzyme-ZLI48A, and using the strategy highlighted 
above, we explored the S 2 , S,, and S,' subsites by varying 
the size of the aromatic rings, the linker lengths, and the 
substituents of the aromatic rings. A third aromatic ring 
system was also introduced (tri-aryl compounds) to study 
the concerted binding of the ligand molecule to the S 2 , S, 
and S,' subsites. Newly synthesized compounds were 
tested by FACS assay for their potency in inhibiting 




Ffg. 2. Optimization of the orientation of ZL148A using rigid 
body minimization. The initial positioning of the ligand in the 
active site cleft (red) was estimated to have an Interaction energy 
of 27.7 kcal mol-' using the AMBER force field. After 1 2 steps of 
minimization, the ligand (purple) moved 0.74 k and the energy 
was lowered to -35.3 kcal moH. Although the change in the 
position of the ligand was small, the difference between the 
energies of the starting and final conformations is significant. 
Green, yellow, and cyan colors highlight key residues in the S,, 
S,, and S,' binding subsites, respectively. 
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Fig. 3. Strategy for lead optimization. The fit of.the compound within the S 2 , S, and S,' subsites was explored by varying the size of the 
aromatic rings and the length of the linker, and further optimized by varying ring substltuents. A third aromatic system was also intro- 
duced to explore the concerted binding of the ligand to these subsites. Six compounds are shown in this chart to illustrate the strategy. 



parasite growth. The assay results of some representative 
compounds are shown in Fig. 4. 

From a study of analogs with varying linker lengths and 
ring sizes, we found that compounds with half the 
length of the original linker and a phenyl ring instead of, 
a naphthyl ring (compound ZLIII109A in Fig. 4) 
seemed to be twice as effective as the original lead, 
ZLI48A. A series of compounds was also made to test 
the role of ring substituents on inhibitor potency. It 
seems that hydroxyl groups it the 2, 2\ and 4' positions 
are required for effective inhibition. The most potent 
compound (ZLIV44A) has an IC^ value of 150 nM. 
The best tri-aryl inhibitor tested so far, ZLIV114A, has 
an IC 50 value of 450 nM. Overall, the four compounds 
listed in Fig. 4 have absolute IC S q values at or below 
1 (iM. It is clear from Fig. 4 that there is a progressive 
increase in inhibition potency following the design, 
synthesis, and testing paradigm. 

Modeling of ligand binding 

DOCK calculations (in DOCK single mode) of the 
binding of a series of bis aromatic compounds with one 
naphthyl ring, a short linker, and one smaller aromatic 
ring reveal some interesting insights. The results indicate 
that there are essentially two ways to orient an asymmet- 
ric Ligand. Both binding modes are similar to that for the 
symmetrical ZLI48A, shown in Fig. 1. In one binding 



mode (Fig. 5a), the larger naphthyl ring interacts with 
the deeper, well-defined S 2 specificity subsite, whereas 
the phenyl ring binds to the more accessible, but less 
well-defined, Sj' subsite. Alternatively, the ligand 
molecule can be rotated 180 degrees around the 
subsite (Fig. 5b). In this binding mode, the naphthyl ring 
now interacts with the S,' subsite, whereas the phenyl 
ring binds to the S t and S 2 subsites. Compounds with 
three aromatic components seem more Likely to adopt 
the binding mode shown in Fig. 5a. In this orientation, 
the larger naphthyl ring binds to the well-formed S 2 
subsite and the third ring now interacts fully with the 
more accessible, less well-defined S,' subsite. 

A recent report on the structures of inhibitors bound 
to the serine protease elastase, with K ; s as small as 
10 nM, shows that chemically similar inhibitors can 
adopt different binding modes and interact with 
different subsites (17], Several inhibitors were found to 
bind to the enzyme in an orientation opposite to that 
of natural substrate and other chemically similar 
inhibitors. The compounds with two-fold degeneracy 
that we have designed and tested may also bind to the 
malarial protease with more than one binding mode. 
Obviously, this could complicate the development 
of a structure— activity relationship. Further work, is 
under way to explore the availability of alternative 
binding modes. 
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In either binding mode, it seems that polar substituents 
on either aromatic ring can potentially foim hydrogen 
bonds with a number of side chains lining the binding 
subsitcs as well as with backbone oxygen and nitrogen 
atoms from some loop residues. This observation is con- 
sistent with our finding that the 2 and 2' hydroxyl groups 
arc required for efFective inhibition. Substitutions at 
other positions, such as 4 and 4", are also associated with 
increased potency. For example, calculation of the model 
of the ?rotease-ZLIHl09A complex suggests possible 
hydrogen Jionding interactions between residues Glnl9, 
Hisfi7. Cys25, Asnl33, Hisl59, Serl60, Trpl77 and the 
ligand oxygen and nitrogen atoms. Presumably, this 
explains part of the correlation between these ligand 
hydroxyl groups and compound potency. The modeling 
studies also provide a basis for the design of other tri-aryi 
compounds, a group of ligands that could lead to even 
more potent inhibitors via their extensive interactions 
with the protease subsites. 



Conclusions 

We report here the results of a structure-based approach 
for inhibitor design targeted toward a critical malarial 
protease. Starting with a lead compound identified 
by computer screening of a three-dimensional small 
molecule database, wc have designed molecules with 
significantly increased inhibition potency against malaria 
parasites. This method provides a practical strategy for 
future work on structure-based drug design even in the 
absence of a crystallographic structure of the target 
enzyme. Our approach not only emphasizes the 
structural rationale for inhibitor design, but also uses 
simple chemistry and commercially available starting 
materials for inhibitor synthesis. This approach permits 
us to screen a relatively large number of potential 
inhibitors rapidly and inexpensively. 



FiK- 4. Inhibition of the growth of 
maioria parasites in red blood cells by 
ZLI48A, the initial lead compound, and 
selected derivatives. 



The potency of this new class of compounds in preventing 
parasite growth in vitro begins to approach that of available 
traditional quinolinc-based drugs. As their target is distinct 
from that of chloroquirie, these compounds should be 
effective against chloroquine-resistant strains of the malaria 
parasite. The FACS assay we used in this report provides a 
practical and relevant method for evaluating inhibitor 
potency. The combination of structure -based screening 
and design coupled with simple chemical synthesis and a 
relevant biological assay has rapidly led to identification of 
a series of increasingly potent anti-malarial agents. 
Although therapeutic candidates will need to be active at 
low nanomolar concentrations, we anticipate that modi - 
fications of existing analogs will result in molecules that 
are suitable for preclinical efficacy and toxicology studies. 



Significance 

Malaria parasites, usually Plasmodium falciparum and 
Plasmodium vivax, infect 280 million people 
annually [1], and multi-drug resistant P. falciparum 
has become a significant pathogen in areas where 
the disease is endemic. We have chosen a cysteine 
protease that is central to the parasite's life cycle 
bat distinct from the target of cnloroquine action 
as a target for a structure-based drug development 
program. The choice of this target means that the 
chemotherapeutics that we develop should be 
active against chloroquine-resistant organisms. 

Structure-based drug design usually depends upon 
the experimental determination of the target 
structure by X-ray crystallography or NMR spec- 
troscopy. We have circumvented this step and 
relied exclusively on the sequence homology 
between the malaria enzyme and other cysteine 
proteases of known structure. A homoiogy-based 
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Fig. S. Two putative binding modes of a representative 6/s-aryl compound ZLIV44A. In (a) the larger naphthyl ring interacts with 
S 2 , whereas in (b) it interacts with the S, subsite. The protease is shown in white and the ligand in gray. Oxygen and nitrogen 
atoms near the active site are colored in red and dark blue, respectively. The S 2 , S,, and S,' binding subsites are shown in green, 
yellow, and cyan, respectively. 



model of the malaria enzyme was used at the 
template for a computer-based ligand docking cal- 
culation that identiGed a useful lead compound. 
Lead optimization was achieved by a combined 
approach of computational and synthetic analysis. 
Derivatives of the lead were first optimized for fit 
using the computer docking program, then synthe- 
sized and experimentally tested. 

The entire process of lead optimization was 
completed in eighteen months and resulted in the 
identification of compounds that are about ten 
times more effective than the initial lead and are 
simple and inexpensive to produce. This study 
shows that it is possible to start with a protein 
sequence and end with non-peptidic small 
molecule inhibitors of a medically relevant target 
enzyme. Thus, there may be many more systems 
amenable to structure-based drug development 
efforts than was previously believed. 



Materials and Methods 

Computer modeling 

The three-dimensional structures of potential inhibitors were 
constructed interactively using the molecular modeling program 
SYBYL and the CONCORD conversion algorithm (Tripos 



Associates, St Louis, MO). The Cambridge Crystallography 
Database was used to determine probable low-energy conforma- 
tions of certain ligand groups. Partial charges of ligands were 
calculated using the Gasteigcr-Marsili method. DOCK 3.0 was 
used in single ligand mode to aid inhibitor design. All plausible 
binding orientations of a single ligand that meet certain nser- 
defincd criteria (contact and force field scores) were obtained 
by DOCK. Rigid body minimization of DOCK-derived 
ligand-proteasc complexes were performed to optimize the inter - 
action energy between the protease and the ligand. This method 
provides a detailed profile of potential binding orientations of a 
ligand molecule in the active site of the malarial protease. Ml 
computer-assisted modeling and docking studies were carried out 
using a Silicon Graphics workstation IRIS4D/35 or Indigo2. All 
color figures were produced using the MidasPlus program from 
the Computer Graphics Laboratory, University of California, San 
Francisco (supported by NIH RR-0I081) [18,19]. 

Biological assay 

An assay based on FACS analysis was used to evaluate the 
potency of the compounds reviewed in this paper against 
parasite growth. Synchronized trophozoite-sage parasites were 
cultured in human blood at vaiious inhibitor concentrations. 
The parasites were allowed to mature, lyse the host cell and 
attempt invasion of fresh red blood cells. Using propicillin 
iodide to stain DNA. the FACS can discriminate between 
infected and uninfected cells and between stages of intra - 
erythrocytic parasite development, as only infected red blood 
cells contain DNA [20]. 
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because .ill of the clinical manifestations of malaria are caused 
by the erythrocytic cycle of lysis and reinfection, this assay is 
especially relevant for evaluating the efficacy of potential anti- 
malarial agents. The full dose inhibition curves for oxalic 
M(2-liydroxy-1-naphthylmechylene)hydrnzide) and several of 
the key derivatives are shown in Fig. 4. 

Chemical syntheses 

2-lIydroxy-l-naphthaldehyde, oxalic dihydrajtide, salicylic 
hydrazidc, methyl 2,4-dihydroxybenzoate and 4-nitrobcnzyl 
bromide were all purchased from Aldrich. 2.4-Dihydroxy- 
bf. n/.oic hydrteirlc was obtained from Trans World Chemicals. 
2.4-Dihydroxynaphthaidchydc was prepared from 1,3- 
ililiydroxytiaphthalenes (Aldrich) according to published 
procedures [21]. 

Genera! procedure for condensation of aldehyde with acyl- 
hydrazine: to a solution of the aldehyde (1 mmol) in 
methanol (20 ml) was added the corresponding acylhydrazine 
(I mmol) in one portion. The resulting mixture was heated at 
reflux for 3 h. In most cases, a precipitate was observed after 
10 rnin. The precipitate was filtered, washed with hot 
methanol (50 ml) and dried in vacuum (2 mm Hg). If needed, 
additional purification was performed by recrystallization 
using appropriate solvents. 

l-or the tri-aryl compounds, the acylhydrazine was made as 
follows: a mixture of benzyl bromide (10.0 mmol), methyl 
hydroxybenzoatc (10.0 mmol) and cesium carbonate 
(10.0 mmol) in acetone (80 ml) was heated at 56 °C for 18 h. 
This mixture was filtered, and the filtrate was concentrated to 
give the corresponding methyl phenylmethyleneoxybenzoatc. 
This crude methyl ester was then dissolved in EtOH (80 ml) 
and treated with hydrazine monohydrate (5.01 g, 100.0 mmol). 
The resulting mixture was stirred overnight at 20 °C and then 
concentrated to give the corresponding acylhydrazide, which 
was further puriBed by recrystailization from EtOH/HjO (7:3, 
v/v). The structures of these compounds were confirmed by 
spectroscopic methods. 
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Figure 9: Sequence Alignment of hIGF-lR. hIR and hIRR ectodomains. 

Derived by use of the PileUp program in the software package of the 
Genetics Computer Group. 575 Science Drive, Madison, Wisconsin, USA. 

Symbol Comparison cable: GenRunData : PiieUpPep. Or.p CompCheCk: 1254 

GapWeighc : 3.0 
GapLengthWeight : 0.1 



Name: Higflr Len: 
Name: Hir Len: 
Name: Hirr Len: 



972 CheCk: 1781 Weighs: 1.00 
972 CheCk: 2986 Weight: 1.00 
972 CheCk: 9819 Weight: 1-00 
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EGYLKILLIS K. . AEDYRSY 43 
EGHLQILLMF KTRPEDFRDL 49 
EGHLQILLMF TATGEDFRGL 45 



Higflr RFPKLTVITE YLLLFRVAGL E5LGDLFP ML T VIRGWKLFY NYALVI FEMT 93 
Hir SFPKL1M1TD YLLLFRVYGL ESLKDLFP NL T VIRGSRLFF NYALVI FEMV 99 
Hirr SFPRLTQVTD YLLLFRVYGL ESLRDLFPNL AVIRGTRLFL GYALVI FEMP 95 



Higflr NLKDIGLYNL RNITRGAIRI EKNADLCYLS TVDWSLILDA VSNNYIVGNX 143 
Hir HLKELGLYNL MHITRGSVRI EKNNELCYLA TIDWSRILDS VEDMYIVLMK 14 9 
Hirr HLRDVALPAL GAVLRGAVRV EKNQELCHLS TIDWGLLQPA PGANH1VGNK 14 5 



Higflr PPK.ECGDLC PGTMEEKPM. CEKTTINMEY NYRCWTTMRC QKMCPSTCGK 191 
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Hirr MACTARGECC HTECLGGCSQ PEDPRACVAC RHLYFQGACL WACPPGTYQY 243 



Higflr EGWRCVDRDF CANILSAES . . . . SDSEGFV IHDGECMQEC PSGFIRNGSQ 287 
Hir QDWRCVNFSF OQDLHHKCKN SRRQGCHQYV IHNNKCIPEC PSGYTMNSSN 298 
Hirr ESWRCVTAER CASLHSVPG RASTFG IKQGSCLAQC PSGFTRNSS. 287 



Higflr SMYCIPCEGP CPKVCEEEKK TKTIDSVTSA QMLQGCTIFK GNLLINIRRG 337 
Hir . LLCTPCLGP CPKVCHLLEG EKTIDSVTSA QELRGCTVI N GS LIINIRGG 347 
Hirr SIFCHKCEGL CPKECKV..G TKTIDSIQAA QDLVGCTHVE GSLILMLRQG 335 



Higflr NNIASSLENF MGLIEWTGY VKIRHSHALV SLSFLKNLRL ILGEEQLEGH 337 

Hir NNLAASLEAN LGLIEEISGY LKIRRSYALV SLSFFRXLRL IRGETLEIGN 397 

Hirr YNLEPQLQHS LGLVETITGF LKIKHSFALV SLGFFKNLKL I RGDAMVDGN 385 

Higflr YSFYVLDNQN LQQLWDWDHR NLTIKAGKMY FAFNPKLCVS EIYRMEEVTG 437 

Hir YSFYALDNQN LRQLWDWSKH NLTITQGKLF FHYNPKLCLS EIHKMEEVSG 4 47 

Hirr YTLYVLDNQN LQQLGSWVAA GLTIPVGKIY FAFNPRLCLE HIYRLEEVTG 435 
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Higflr 
Hir 
Kirr 

Higflr 
Hir 
Hirr 



• !End of 1-462 fragment 

TKGRQSKGDI NTRNNGERAS CESDV LHFTS TTTSKNRIII TWHRYRPPDY 487 

TKGRQERNDI ALKTNGDQAS CENEL LKFSY IRTSFDKILL RWEPYWPPDF 497 

TRGRQNKAEI NPRTNGDRAA CQTRT LRFVS NVTEAJDRILL RWERYEPLEA 4 85 

RDLISFTVYY KEAPFKNVTE YDGQDACGSN SWNMVDVDLP PNKDV 532 

RDLLGFMLFY KEAPYQNVTE FDGQDACGSN SWTWDIDPP LRSNDPKSQN 547 

RDLLSFIVYY KESPFQ NAT E HVGPDACGTQ SWNLLDVEIiP L SRTQ 530 



Higflr EPGILLHGLK PWTQYAVYVK AVTLTMVEND HIRGAKSEIL YIRTNASVPS 582 

Hir HPGWLMRGLK PWTQYAIFVK TL. VTFSDER RTYGAKSDII YVQTDATNPS 596 

Hirr EPGVTLASLK PWTQYAVFVR AITLTTEEDS PHQGAQSPIV YLRTLPAAPT 580 

Higflr IPLDVLSA5 N SS SQLIVKWN PPSLPNGNLS YYIVRWQRQP QDGYLYRHNY 632 

Hir VPLPPISVS N SS SQIILKWK PPSDPNGNIT HYLVFWERQA EDSELFELDY 646 

Hirr VPQDVISTS N SS SHLLVRWK PPTQRNGNLT YYLVLWQR1A EDGDLYLNDY 630 



Higflr CSKD.KIPIR KYADGTIDIE EVTENPKTEV CGGEKGPCCA C. . . PKTEAE 678 

Hir CLKGLKLPSR TWS.PPFESE DSQKHNQSE. YEDSAGECCS C. . . PKTDSQ 691 

Hirr CHRGLRLPTS N.NDPRFDGE DGDPEAEME SDCCP CQHPPPGQVL 673 

a >< P 

Higflr KQAEKEEAEY RKVFENFLHM SIFVPRPERK RRDVMQVA NT T MSSR5RNTT 728 

Hir ILKELEESSF RKTFEDYLHN WFVPRPSRK RRSLGDVGNVJTVAV? . . . TV 738 

Hirr PPLEAQEASF QKKFENFLHN AITIPISPWK VTSINKSPQR D.SGRHRRAA 722 



Higflr AA. . DTY NIT DPEELETEYP FFESRVDNKE RTVISNLRPF TLYRIDIHSC 776 
Hir AAFP NTS STS VPTSPEEHRP F..EKWNKE SLVISGLRHF TGYRIELQAC 786 
Hirr GPLRLGGNSS DFEIQEDKVB RE RAVLSGLRHF TEYRIDIHAC 764 



Higflr NKEAEKLGCS ASMFVFARTM PAEGADDIPG PVTWEPRPEN SIFLKWPEPE 826 
Hir NQDTPEERCS VAAYVSARTM PEAKADDIVG PVTHEIFENN WHLMWQEPK 836 
Hirr NHAAHTVGCS AATFVFARTM PHREADGIPG KVAWEASSKN SVLLRWLEPP 814 



Higflr NPNGLILMYE IKYGS.QVED Q RE CVS RQEY RKYGGAKLNR LNPGNYTARI 875 
Hir EPNGLIVLYE VSYRRYGDEE LHLCVSRKHF ALERGCRLRG LSPGNYSVRI 886 
Hirr OPMGLILKYE IKYRRLGEEA TVLCVSRLRY AKFGGVHLAL LPPGNYSARV 864 



Higflr QATSLSGNGS WTDPVFFYVQ AKTGYENFIH L 
Hir RAT S LAGNGS WTEPTYFYVT DYLDVPSMIA K 
Hirr RATS LAG NGS WTDSVAFYIL GPEEEDAGGL H 
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